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Part I 

Physics 

Professor Dr. Peter Nielaba 


Many important results have been achieved by the computer time granted at the 
HLRS and the projects are partly embedded in research clusters, e.g. “clusters of 
excellence” or “Sonderforschungsbereiche”. The contributions in the proceedings 
present the results of large scale simulations for elementary particle models, for 
systems on nano- and micrometer length scales, and for astrophysics phenomena, 
and are summarized and commented below. 

O. Philipsen from the University of Frankfurt and Ph. de Forcrand from the ETFI 
Zurich and CERN (“ muQCD ”) have investigated the critical surface bounding the 
region featuring chiral phase transitions in the quark mass and chemical potential 
parameter space of quantum chromo dynamics (QCD) with three flavors of quarks. 
Their calculations are valid for small to moderate quark chemical potentials, /i < 7\ 
The authors study the situation for the limiting case of two light flavours and show 
first results on the nature of the chiral transition at zero chemical potential from 
extrapolations using imaginary chemical potential. 

For their Monte Carlo simulations the authors used the standard Wilson gauge 
and Kogut-Susskind fermion actions. Configurations are generated using the 
Rational Hybrid Monte Carlo (RHMC) algorithm. In order to investigate the critical 
behavior of the theory, the authors use the Binder cumulant as an observable. For 
each set of fixed quark mass and chemical potential, the critical coupling [ J > c has 
been interpolated from a range of typically three to five simulated /3 -values by 
Ferrenberg-Swendsen reweighting. The simulations have been performed on the 
NEC-SX9 at the HLRS in Stuttgart. An estimate of the Binder cumulant for one set 
of mass values consisted of at least 200k trajectories, and the estimate of a critical 
point required at least 500k trajectories. 


Professor Dr. Peter Nielaba 

Fachbereich Physik, Universitat Konstanz, 78457 Konstanz, Germany 
e-mail: peter.nielaba @uni-konstanz.de 
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P. Baikov, K. Chetyrkin, J.H. Kuhn, P. Marquard, M. Steinhauser and T. Ueda 
from the KIT Karlsruhe (“ ParFORM ”) have investigated multi-loop Feynman inte¬ 
grals by a computer algebra program (“PARFORM”), using MPI on the XC4000. 
The computation of Feynman integrals is important for the computation of quantum 
corrections to physical processes, the code is under constant development, and the 
authors give an overview on the present status. 

A. Riefer, M. Rohrmiiller, M. Landmann, S. Sanna, E. Rauls, U. Gerstmann, and 
W.G. Schmidt from the University of Paderborn (“ MolArchl ”) have investigated 
the electronic structure and optical response of 2-aminopyrimidine molecules by 
a combination of density functional theory and many-body perturbation theory. 
The calculations predict quasiparticle gaps, i.e., differences between the ionization 
energies and electron affinities, of about several eV for the molecules, and the 
result indicate a near cancellation of the electronic self-energy and exciton binding 
energies for the lowest excitations of 2-aminopyrimidines. The authors find a strong 
influence of local-field effects as well as resonant-nonresonant coupling terms in the 
electron-hole Hamiltonian on the optical properties. 

The authors have used the Vienna Ab-initio Simulation Package (“VASP”) 
implementation of the gradient-corrected density functional theory (DFT-GGA) for 
their computations of the ground state and GWA calculations. The HSE06 hybrid 
functional has been used as well. For the electronic self energy calculations applying 
perturbation theory (“Go JTo”) and Bethe-Salpeter type calculations the cell size has 
been varied. The calculations within this project were performed on the CRAY XE6 
at the HLRS with good scaling properties. 

K. Binder, P. Virnau and A. Winkler from the University of Mainz (“ colloid ”) 
have investigated the spinodal decomposition of colloid-polymer mixtures between 
walls including hydrodynamic interactions by the multi particle collision dynamics 
and domain decomposition methods on Hermit. Polymers are described as soft 
spheres weakly interacting with each other, while colloid-polymer and colloid- 
colloid pairs interact with the (repulsive)Weeks- Chandler-Andersen potential, so 
that a depletion attraction between colloids results, similar to the Asakura-Oosawa 
model. Large system sizes and long simulation times are required for this study, 
and interesting results of the effect of confining geometry on the decomposition 
dynamics have been achieved and will be studied in the future. Important results 
are the effects of hydrodynamics and of different boundary conditions on the 
growth exponents. The authors used the standard halo layer domain decomposition 
technique to treat the embedded particles and the parallelization approach proposed 
by Sutmann et al. for the solvent particles. This level of parallelization allowed 
the authors to use 1,024 cores (or more) to study successfully the phase separation 
kinetics in the Asakura-Oosawa model in huge systems (max. 1,100,000 MD 
particles and 52,000,000 solvent particles) over multiple scales of MD time. The 
authors find that for huge systems at the late time stages of phase separation the net 
cost of explicit solvent particles in the framework of MPCD is only about approx. 
10 % in comparison to standard Molecular Dynamics simulations. 

S. Schmieschek, A. Narvez, and J. Harting from the University of Stuttgart 
and the Eindhoven University of Technology (“ icpsusp ”) investigate multiple 
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component flows in porous media by lattice Boltzmann simulations on the XC2 
cluster at the SSCK. The authors integrated a “multi relaxation time” collision 
scheme with their pseudo-potential multiphase lattice Boltzmann implementation 
“lb3d” for fluids with multiple components, and good scaling is achieved. The steps 
taken to optimize the performance have allowed to limit the increase in calculation 
time close to the expected minimum. The authors show that for multi-component 
systems the additional calculation time spent in the collision step is very small in 
comparison to the time increase due to the calculation of the interaction forces. The 
authors give an overview on the present status. 

R. J. Geretshauser, F. Meru, K. Schaal, R. Speith, and W. Kley (University of 
Tubingen and ETH Zurich, “SPH-PPC”) have analyzed pre-planetesimal collisions 
with their solid body smoothed particle hydrodynamics (SPH) code parasph. 
The main focus of the project is on the investigation of the growth conditions 
for macroscopic pre-planetesimals. By parameter studies the authors investigated 
fragmentation criteria in dust collisions depending on aggregate size and aggregate 
porosity, and they extended their previous study on bouncing criteria of equally 
sized aggregates depending on their porosity and the presence of compacted shells 
of various porosities. The authors derive fragmentation criteria for dust cylinders 
depending on angular velocity as well as porosity and perform corresponding 
simulations. 

The code parasph, used by the authors, is based on the “parasph” library, 
featuring domain decomposition, load balancing, nearest neighbor search, and inter¬ 
node communication, extended for the simulation of ductile, brittle, and porous 
media and by an implementation of a porosity model and by SPH enhancements. 
The parallel implementation utilizes the Message Passing Interface (MPI) library, 
and HDF5 was included as a compressed input and output file format with increased 
accuracy, decreasing the amount of required storage space. The simulations were 
carried out on the NEC Nehalem cluster of the HLRS with 240,143-476,476 SPH 
particles depending on the size of the projectile. Thirty-two to eighty cores were 
used, and simulations roughly took 72-240 h for 1 s of simulated time, depending 
on the size of the problem and the involved physical process. 

S. C.O. Glover, PC. Clark and R.S. Klessen from the University of Heidelberg 
(“ EDuCool ”) investigate the evolution of star-forming clouds for a wide range 
of metallicities and the effects on the mass function of the fragments that form, 
on cooling and heating rates, and on the number of Bonnor-Ebert masses of the 
fragmenting clouds. The authors use a modified version of the Gadget-2 (SPH) 
code, a typical simulation run is done with 40 million SPH particles and requires 
130 kCPU-h on 256 or 512 CPUs. Interesting results have been obtained on the 
thermodynamical evolution of gas and dust, the fragmentation and the properties of 
the fragments. 

F. Hanke, A. Marek, B. Muller, and H.-Th. Janka from the MPI for Astrophysics 
in Garching (“ SuperN ”) investigate core-collapse supernova explosions of massive 
stars. The authors have developed a fully MPI-OpenMP parallelized version of their 
VERTEX-PROMETHEUS code in order to perform three-dimensional simulations 
of stellar core-collapse and explosion. The simulations typically require 10 2 " 
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floating point operations, and good scaling up to 32,000 cores has been achieved 
on Hermit. 

K.D. Kokkotas, P. Lasky and B. Zink from the University of Tubingen (“ Mcigne- 
tar”) have investigated the dynamic stability of strong magnetic fields inside highly 
magnetized neutron stars by general relativistic magnetohydrodynamics simulations 
by the Horizon code on GPUs. Parts of the simulations have been done on the 
Nehalem cluster at the HLRS. The authors have varied the stiffness of the equation 
of states and the rotation of the neutron star and were able to simulate the time 
evolution for hundreds of milliseconds or even seconds. Their results are important 
contributions to an Sonderforschungsbereich on “Gravitational Wave Astronomy”. 


Constraints on the Two-Flavour QCD Phase 
Diagram from Imaginary Chemical Potential 


O. Philipsen and Ph. de Forcrand 


In a long term project, we calculate the critical surface bounding the region featuring 
chiral phase transitions in the quark mass and chemical potential parameter space of 
QCD with three flavours of quarks. Our calculations are valid for small to moderate 
quark chemical potentials, /i < 7’. The presence of tricritical lines at imaginary 
chemical potential /! = i jT, with known scaling behaviour in their vicinity, puts 
constraints on this phase diagram. Here we undertake first steps to study the situation 
for the limiting case of two light flavours. In this case, the nature of the chiral 
transition at zero chemical potential is not yet established. We show first results of 
our project to extract this behaviour from extrapolations using imaginary chemical 
potential. 


1 Introduction 

The fundamental theory describing the strong interactions is Quantum Chromody¬ 
namics (QCD) with two light quark flavours, the u- and t/-quarks, and a heavier 
,v-quark. Since the interaction weakens at asymptotically large energy scales, QCD 
predicts at least three different forms of nuclear matter: the usual hadronic matter 
at low temperature and density, a quark gluon plasma at high temperature and low 
density, and colour superconducting nuclear matter at low temperatures and high 
density. Direct Monte Carlo simulations of the finite density QCD phase diagram 
are impossible because of the so-called sign problem, so that indirect methods need 


O. Philipsen (E3) 

Institut fiir Theoretische Physik, Goethe-Universitat Frankfurt, 60438 Frankfurt am Main, 
Germany 

e-mail: philipsen@th.physik.uni-frankfurt.de 
Ph. de Forcrand 

Institut fiir Theoretische Physik, ETFI Zurich, CH-8093 Zurich, Switzerland Physics Department, 
TH-Unit, CERN, CH-1211 Geneva, Switzerland 


W.E. Nagel et al. (eds.), High Performance Computing in Science and Engineering ’12, 
DOI 10.1007/978-3-642-33374-3_l, © Springer-Verlag Berlin Heidelberg 2013 
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Fig. 1 Left. Schematic phase transition behaviour of Nf =2+1 QCD for different choices of 
quark masses (m u j,m s ) at [i = 0. Right: The same with chemical potential for quark number 
as an additional parameter. The critical lines sweep out surfaces as /x is turned on. At imaginary 
chemical potential /j, = in/'iT, the critical surfaces terminate in tricritical lines which determine 
their curvature through critical scaling 


to be employed, which work for small enough [i/T only (for an overview and 
references, see [1]). 

At zero chemical potential, the nature of the quark-hadron phase transition 
depends on the quark masses, as summarised in Fig. 1 . In the limits of zero and 
infinite quark masses, order parameters for the breaking of the global chiral and 
centre symmetry, respectively, can be defined, and one finds numerically that first 
order phase transitions take place at some finite temperature T c . On the other hand, 
for intermediate quark masses the transition is an analytic crossover. Hence, each 
corner of first order phase transitions is bounded by a second-order critical line as 
in Fig. 1. The physical quark masses are light, so our interest is in the lower left 
boundary, which is called the chiral critical line, as opposed to the deconfinement 
critical line in the heavy mass region. 

In previous work the location of the boundary line has been determined for the 
case of degenerate quark masses, Nf = 3 [2, 3], where it was also shown that it 
belongs to the 3d Ising, or 3d Z(2), universality class. On the lattice, temperature 
and lattice spacing are related by T = 1 /(aN t ), i.e. larger N t corresponds to 
finer lattices for a fixed physical temperature. We have used N t = 4 lattices, 
corresponding to a lattice spacing a ~ 0.3 fm, to map out how this line changes 
(i) for Nf = 3 as a function of chemical potential /x [3,4] and (ii) for fi = 0 in the 
case of non-degenerate quark masses m u d ^ m s [4], It was found that the physical 
point is located close to the boundary line on the crossover side. 

When a chemical potential for the baryon density is switched on, the chiral 
critical point of the N/ = 3 theory recedes as in Fig. 1 (right). The critical quark 
mass marking the boundary line can be expanded as 


m c (lL) 
m c { 0) 


1 + Ci 




+ ... 


( 1 ) 









Nf = 2 QCD at imaginary chemical potential 
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with C\= — 3.3(3), C2= — 47(20) [5, 6). The same behaviour is found for non¬ 
degenerate quark masses. Tuning the s —quark mass to its physical value, we 
calculated m u c ' d (pi) wither = — 39(8)andc2 < 0 [7], Similar behaviour is found for 
heavy quarks. Hence, the critical lines sweep out surfaces, as shown in Fig. 1 (right). 
Contrary to common expectations based on simplified models, the curvature of both 
surfaces is such that the region of first order shrinks, i.e. the phase transitions weaken 
when a real chemical potential is switched on. As a consequence, the physical point 
remains in the crossover region and there is no chiral critical point in QCD for 
moderate baryon densities. 

In the current and future project, we explore in particular the implications for 
the behaviour of QCD in the two-flavor chiral limit (m u = m j = 0, m s = oo). In 
that limit, it is widely believed that QCD undergoes a finite-temperature, second- 
order 0(4) chiral transition at // = 0, which turns first-order at a tricritical point 
for some real /i. However, other possibilities exist. At // = 0 in particular, the 
finite-temperature transition might be first-order. The present numerical evidence 
is inconclusive: using Wilson fermions, 0(4) scaling is preferred [8], while with 
staggered fermions 0(4) scaling has been elusive, and first-order behaviour has 
also been claimed [9], Note that behaviour consistent with 0(4) has been seen with 
improved staggered fermions, in an Nf = 2+1 setup where the strange quark mass 
is fixed at its physical value [10]. Approaching the chiral limit from the imaginary // 
direction, i.e. working in the back plane of Fig. 1 (right), offers a novel, independent 
method to help settle the issue. 


2 The Binder Cumulant and Universality 


In order to investigate the critical behaviour of the theory, we use the Binder 
cumulant [11] as an observable. It is defined as 


B^(m, /i) 


m) 4 ) 

m) 2 ) 1 ' 


( 2 ) 


with the fluctuation 8X = X — (X) of the order parameter of interest. Since we 
investigate the region of chiral phase transitions, we use the chiral condensate, X = 
For the evaluation of the Binder cumulant it is implied that the lattice gauge 
coupling has been tuned to its pseudo-critical value, /l = fi c (rn, /i), corresponding 
to the phase boundary between the two phases. In the infinite volume limit the 
Binder cumulant behaves discontinuously, assuming the values 1 in a first order 
regime, 3 in a crossover regime and the critical value sb 1.604 reflecting the 3d Ising 
universality class at a chiral critical point. On a finite volume the discontinuities are 
smeared out and B$ passes continuously through the critical value. This is sketched 
in Fig. 2. In the neighbourhood of the chiral critical point at zero chemical potential 
it can be expanded linearly 



O. Philipsen and Ph. de Forcrand 


Fig. 2 Schematic behvaviour 
of the Binder cumulant as a 
function of quark mass for 
jU = 0. First order transitions 
and crossovers correspond to 
B 4 = 1, 3, respectively, 
whereas a second order 3d 
Ising transition is 
characterised by B 4 1.604. 
On finite volumes the step 
function gets smeared out 



First order 


0.5 


0 


0 

(m—m c ) 


B\(m , /a) = A + B (am — am c ) + C(ajA) 2 ..., 


(3) 


with A 


1.604 for V 


00. 


3 Continuation of the Critical Surfaces to Imaginary /i 

Let us consider imaginary chemical potential, /a = ifAi . The QCD partition function 
exhibits two important exact symmetries, reflection symmetry in /x and Z(3)- 
periodicity /i ( - , which hold for quarks of any mass [12], 



(4) 


for general complex values of [A. The symmetries imply transitions between adjacent 
centre sectors of the theory at fixed /a f = (2 n + 1 )rcT/3,n = 0, ± 1, ±2,.... The 
Z(3)-sectors are distinguished by the Polyakov loop 



(5) 


r=l 


whose phase (p cycles through (1 p) = n(2ji/3),n = 0,1.2,... as the different 
sectors are traversed. Moreover, the above also implies reflection symmetry about 
the Z(3) phase boundaries, Z(/rJ' + /Aj) = Z(iA c t — /Ai). 

Transitions in /Aj between neighbouring sectors are of first order for high T 
and analytic crossovers for low T [12-14], as shown in Fig. 3 (left). The order 
parameter is the shifted phase of the Polyakov loop, cp = (p — /i,/ T . Away from 
/Aj = [A ?, there is a chiral or deconfinement transition line separating high and 






Nf = 2 QCD at imaginary chemical potential 
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Fig. 3 Left', phase diagram for imaginary /x. Vertical lines are first order transitions between Z(3)- 
sectors, arrows show the phase of the Polyakov loop. The /x = 0 chiral/deconfinement transition 
continues to imaginary /x, its order depends on Nf and the quark masses. Right: Schematic phase 
diagram of the Roberge-Weiss endpoint 


low temperature regions. This line represents the analytic continuation of the chiral 
or deconfinement transition at real /x. Its nature (first, second order or crossover) 
depends on the number of quark flavours and masses. We have investigated the 
nature of the junction of those lines for Nf = 3 [15], similar investigations have 
been carried out for Nf = 2 [16]. The combined outcome is shown in Fig. 3 (right), 
which corresponds to the bottom plane of Fig. 1 (right). The Roberge-Weiss point is 
of first order for heavy and light quarks, and of second order for intermediate mass 
quarks. These regimes are separated by tricritical lines. Four points, corresponding 
to the Nf = 3 [15] and Nf = 2 boundary points, have been explicitly computed. 
The most natural generalisation to non-degenerate quark masses is that they are 
continuously connected by tricritical lines. 

Figure 1 (right) represents the connection between the /x = 0, inT/3 phase 
diagrams. In the vicinity of a tricritical point, scaling laws apply. In our case, the 
scaling exponents governing the behaviour near the tricritical point are mean-field. 
This is shown for the example of heavy quarks, where 

[(pt/T) 1 + (jt/ 3) 2 ] oc (m u , d - /«tric) 5/2 ■ (6) 

Figure 4 shows the corresponding behaviour of the critical heavy quark mass as a 
function of chemical potential. 

Note that the Nf = 2 (i.e. m s = oo) “backplane” contains two tricritical points 
on the chiral critical surface: one in the Roberge-Weiss plane, the other on the 
m u ,d = 0 vertical axis. The location of the latter is related to the value of the 
tricritical strange quark mass. They should be again connected by critical lines 
related to tricritical scaling. 
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Fig. 4 For heavy quarks, tricritical scaling in the vicinity of the Roberge-Weiss imaginary-/^ value 
extends far into the region of real /x [15] 


m=0.0025 



Fig. 5 Binder cumulants for the lightest fermion mass, am = 0.0025, determining the critical 
point 


4 Preliminary Nf = 2 Results 

Following the above discussion, we joined forces with an Italian collaboration and 
performed simulations of Nf = 2 QCD. We used staggered quarks of masses 
am q = 0.01 and 0.005 on N, = 4 lattices scanning in (/x/ T) 2 to determine the 
value of imaginary // corresponding to a second-order transition. Our observable is 
the Binder cumulant of the quark condensate. For an example at the smallest quark 
mass, see Fig. 5. Consistent results are obtained from the finite-size scaling of the 





















Nf = 2 QCD at imaginary chemical potential 
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Fig. 6 Critical line joining 
the known tricritical point on 
the horizontal axis, and 
extrapolated to the unknown 
tricritical point on the vertical 
axis 


(mu tr ic(m=0)/T) 2 =0.3 



plaquette distribution. So far we have obtained four critical points. Fig. 6. It seems 
impossible to smoothly match two tricritical scaling curves passing through these 
points. Additional masses are needed to determine the critical curve. Nevertheless, 
assuming convexity of the critical curve already constrains the m u j = 0 tricritical 
point to lie at (/.i/T ) 2 > —0.3. The figure illustrates the case where this point lies at 
H = 0. It might also lie at (/ J./T ) 2 > 0, so that the // = 0 chiral transition would 
be first-order. Preliminary results have been reported in [17]. Additional small-mass 
measurements are underway and will settle this issue. 


5 Simulation Details 

For our Monte Carlo simulations we use the standard Wilson gauge and Kogut- 
Susskind fermion actions. Configurations are generated using the Rational Hybrid 
Monte Carlo algorithm [18]. Our numerical procedure to compute the Binder 
cumulant is as follows. For each set of fixed quark mass and chemical potential, we 
interpolate the critical coupling from a range of typically three to five simulated 
/i-values by Ferrenberg-Swendsen reweighting [19], For each simulation point 
~50k RHMC trajectories have been accumulated, measuring the gauge action, the 
Polyakov loop and up to four powers of the chiral condensate after each trajectory. 
Thus, the estimate of B 4 for one set of mass values consists of at least 200k 
trajectories, and the estimate of a critical point at least 500k trajectories. 

The simulations are performed on the NEC-SX9 at the HLRS in Stuttgart. A scan 
in parameter space involves simulations of many parameter sets. For such a problem, 
parallelisation is achieved trivially by running one set of couplings per node, each 
node running in vector mode. This way of parallelising allows to explore large 
regions of the parameter space at the same time, which is necessary when mapping 
out a phase diagram. At the same time, there is no overhead for parallelisation and 
communication, ensuring maximal computing efficiency and one-to-one scaling of 
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compute power with the number of processors. The vector mode ensures maximal 
throughput for each individual lattice. Typically we work with several nodes at once 
using each of their eight cores in parallel and avoiding communication between 
the nodes. 
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Parallel Computer Algebra and Feynman 
Integrals 


P.A. Baikov, K.G. Chetyrkin, J.H. Kuhn, P. Marquard, M. Steinhauser, 
and T. Ueda 


1 Introduction 

Relativistic quantum field theories are the mathematical tools which are indispens¬ 
able in modern particle physics. In particular in combination with perturbation 
theory, realized with the help of so-called Feynman diagrams, it is possible to make 
predictions which can be confronted with experiment. 

The main motivation of the current project is to study multi-loop Feynman 
integrals, the mathematical expressions of the Feynman diagrams, which have to be 
computed when considering quantum corrections to physical processes. From the 
mathematical point of view a Feynman integral corresponds to a multi-dimensional 
integral over the time and space components of momenta which are defined in a 
complex space-time dimension d. The integrand consists of rational functions of 
scalar products formed by all d -dimensional momenta involved in the problem. 

There are several technical challenges one has to master in the course of such 
calculations. Among the most demanding ones is the processing of huge amount of 
data which has to be processed in intermediate steps. For this reason some years ago 
parallel versions of the computer algebra program Form [1] have been developed 
which are refined and optimized to date in order to solve even more complex tasks. 
The main advantage of FORM is the ability to process huge expressions (basically 
only limited by disk space) in a fast and effective way. 

There are two parallel versions of Form: ParForm [2] and TForm [3], 
Whereas the parallelization of TForm is realized with threads which restricts 
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BAICER N = 16 BAICER N = 16 



The number of CPUs The number of CPUs 


Fig. 1 Run time and speed-up for the benchmark program BAICER on XC4000. For comparison, 
we also plot the result on ttpmoon, a computer cluster in our institute 


TForm to multi-core computers, ParForm uses MPI (“Message Passing Inter¬ 
face”) for the communication among the various processors. Thus ParForm can 
also be used for computer clusters with a fast interconnection, e.g. infiniband. Due 
to the structure of the XC4000 cluster which mostly consists of four-core nodes 
essentially only ParForm can be used for calculations. 

Let us mention that since August 2010 all versions of FORM are open source and 
can be downloaded from [4], 

In the past we have used the cluster XC4000 (and its predecessor XC6000) for 
various calculations running up to ten jobs in parallel. On April 2011 we were, 
however, asked to reduce the number of jobs to at most two since the performance 
of the whole system is significantly reduced if ParForm starts writing or reading 
the intermediate results to or from the hard disk. This makes the use of XC4000 
quite unattractive. For this reason we have not submitted any production job in the 
past 12 months but have used XC4000 essentially to further develop ParForm and 
to benchmark against other platforms. 


2 ParForm Running on XC4000 

In this section we update the performance tests presented in the report from June 
2010. Figure 1 compares the run time and speed-up for a typical benchmark job as 
a function of the number of used CPUs. The red (circles) and blue (squares) lines 
correspond both to calculations performed on XC4000. Whereas for the red data the 
local disks have been used the blue data have been obtained by using global disks. 
The former show a very bad performance (of at most 60MB/s) and thus reach a 
speed-up of 5 only for about 15 CPUs. On the other hand the use of the global disks 
leads to good results (with a theoretical peak performance of 320 and 400MB/s for 
reading and writing, respectively) up to about 8-12 CPUs which is what we are 
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usually using for the production jobs. Of course, the use of the global hard disks 
explains the performance reduction of the system. 

The green curve (triangles) has been obtained by running the benchmark jobs on 
a cluster with 12-core nodes interconnected via QDR infiniband (with about 1 GB/s 
read/write performance). As can be seen from Fig. 1, up to about ten cores the green 
and blue curves are identical. For more CPUs, however, the green curve shows a 
significant better behaviour reaching a speed-up of 10 for about 16 cores and 15 for 
30 cores. 


3 Form 4 and ParForm 

The serial version of Form is under constant development in order to be able to 
be applicable to a broad class of problems which recently resulted in version 4 [5]. 
Of course, modifications of Form require also the adaption of the parallel versions. 
In the last year the main emphasis has been in implementing the new features of 
version 4 into ParForm. 

In Form version 4 two essentially new features, which has been requested by 
many users over the recent years, are introduced: the factorization of expressions 
and the ability to work with rational polynomial functions as coefficients of terms. 
They are based on newly written routines for (multivariate) polynomial operations 
(additions, subtractions, greatest common divisors, factorizations, etc.), for which 
Form uses a number of well-known algorithms. 

All new features added in version 4 are also available in ParForm. Although 
Form and ParForm share most of the source code, some additional code was 
needed to synchronize data between the master process and all worker processes. 
An example is the factorization of $-variables, which are small expressions stored 
in memory. When factorizing a $-variable, each factor is stored in an additionally 
allocated buffer. Therefore the routine for synchronization of $-variables between 
the master and workers was extended such that all factors are also synchronized via 
MPI if the $-variable has been factorized. 
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Electronic and Optical Excitations 
of Aminopyrimidine Molecules from 
Many-Body Perturbation Theory 


A. Riefer, M. Rohrmiiller, M. Landmann, S. Sanna, E. Rauls, U. Gerstmann, 
and W.G. Schmidt 


Abstract Calculations based on (occupation constrained) density functional 
theory using local as well as hybrid functionals to describe the electron- 
electron exchange and correlation are combined with many-body perturbation 
theory in order to determine the electronic and optical excitation properties of 
5-(pentafluorophenyl)pyrimidin-2-amine, 5-(4-methoxy-2,3,5,6-tetrafluorophenyl)- 
pyrimidin-2-amine, and 5-(4-(dimethylamino)-2,3,5,6-tetrafluorophenyl)pyrimidin- 
2-amine. Large quasiparticle shifts and exciton binding energies of about 4eV 
are found. They cancel each other partially and thus allow for a meaningful 
description of the molecular optical response within the independent-particle 
approximation. We find a surprisingly strong influence of local-field effects as 
well as resonant-nonresonant coupling terms in the electron-hole Hamiltonian on 
the optical properties. 


1 Introduction 

Organic semiconductors are important materials for various applications due to their 
low cost fabrication processes and the possibility to fine-tune desired functions by 
chemical modification of their building blocks. While the last years have seen a 
tremendous progress in the understanding of the excitation properties of inorganic 
semiconductors, fueled in part by the availability of advanced computational 
schemes for electronic structure and optical response calculations such as the GW 
approximation (GWA) for obtaining accurate electronic quasiparticle energies and 
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Fig. 1 Schematic model of 5-(pentafluorophenyl)pyrimidin-2-amine (FAP), 5-(4-methoxy- 
2,3,5,6-tetrafluorophenyl)pyrimidin-2-amine (OFAP), and 5-(4-(dimethylamino)-2,3,5,6-tetra- 
fluorophenyl)pyrimidin-2-amine (NFAP) (from left to right). Dark (red), light (yellow), gray, 
lightgray and small balls indicate O, C, N, F and H atoms, respectively 

the Bethe-Salpeter approach (BSE) to calculate electron-hole interaction effects 
[1-6], far less is known about the electronic and optical properties of organic 
crystals. 

Recently, a novel class of organic electronic material has been synthesized by 
the self-assembly and silver(I) complex formation of 2-aminopyrimidines [ 7]. The 
compounds were structurally as well as optically characterized [8] and it was 
found that the solid state absorption differs remarkably from the parent compound 
2-aminopyrimidine. The optical properties could be tuned by changing the silver 
counterion or by the reversible solvent extrusion and interchange. Furthermore, the 
electrical conductivity of the material was proven for a thin crystalline film. 

In order to gain a better understanding of the excitation properties of this class 
of systems, we first study molecular excitations in the respective parent molecules. 
In detail, we present first-principles calculations on the electronic and optical 
properties of 5-(pentafluorophenyl)pyrimidin-2-amine (FAP), 5-(4-methoxy- 
2,3,5,6-tetrafluorophenyl)pyrimidin-2-amine (OFAP), and 5-(4-(dimethylamino)- 
2,3,5,6-tetrafluorophenyl)pyrimidin-2-amine (NFAP) in order to clarify the impact 
of many-body effects and chemical trends. The three aminopyrimidine molecules 
(APM) are shown in Fig. 1. They consist of 22 (FAP), 26 (OFAP) and 30 (NFAP) 
atoms forming a 2-aminopyrimidine ring (atoms 1-8 in Fig. 1 ) and a (per)fluorinated 
phenylring ring (12-17) where the position no. 22 is either a fluorine atom Fi (FAP), 
a methoxy group (OFAP) or an amino group (NFAP). 


2 Methodology 

Ground-state and GWA calculations are performed using the Vienna Ab-initio 
Simulation Package (VASP) implementation [9] of the gradient-corrected [10] 
density functional theory (DFT-GGA). In addition, the hybrid functional due to 
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Heyd, Scuseria, and Ernzerhof (HSE06) [11] was used. The electron-ion interaction 
is described by the projector augmented-wave (PAW) method [12,13]. We expand 
the valence wave functions into plane waves up to an energy cutoff of 400 eV. DFT 
calculations for single molecules were performed using a 14 x 15 x 20 A 3 super cell 
and r point sampling for the Brillouin zone (BZ) integration. Test calculations show 
that the eigenenergies are converged within a few hundredths an eV. For electronic 
self-energy calculations applying perturbation theory (Go Wo) and Bethe-Salpeter 
type calculations (see, e.g.. Ref. [5]) as well as for calculations of charged molecules 
the cell size was varied as described below. 

DFT calculations are known to often considerable underestimate electronic 
excitation energies [4]. Reliable quasiparticle gaps, exciton pair energies and Stokes 
shifts, however, can be obtained from occupation constraint DFT (or ASC’F) 
methods, cf. Refs. [14-16]. Thereby the quasiparticle (QP) gap is obtained directly 
as difference between the ionization energy and electron affinity 

E® p = E{N + 1,R) + E(N - 1, R) — 2E{N, R), (1) 

where E(N, R), E(N + 1, R), and E{N — 1, R) represent the energy of a N, N + 1, 
and N — 1 electron system, respectively, with the equilibrium geometry R of the 
N electron system. The energy of the lowest excitonic excitation corresponding 
to the situation that one electron occupies the lowest unoccupied molecular orbital 
(FUMO) leaving a hole behind in the highest occupied molecular orbital (HOMO) 
is given by 

E ex = E(e - h, R) - E(N, R), (2) 

where E (e — h , R) is the total energy of the system in presence of the electron-hole- 
pair with fixed geometry R. Alternatively, as can be derived from Janak’s Theorem 
(see Ref. [14]), the energy of the lowest optical excitation can be obtained from 
the difference of the eigenenergies of the half-occupied HOMO s h.o .5 and FUMO 
Sl, o. 5 , respectively 

Eex — E ex — Sl, 0.5 £//. 0.5* (3) 

Relaxing the atomic coordinates to the geometry R* for fixed occupation numbers 
yields the lowest emission energy 

E em = E(e — /?, R*) — E(N. R*), (4) 

which can be used to calculate the Stokes shift 

A s = E ex -E em . (5) 

From calculations of the ground-state energy for different cell sizes one can 
conclude an error for the ASCF values of 0.1 eV. The QP gaps Eg are compared 
to the gap E^° w ° that has been obtained from the Go Wo approximation of the 
electronic self energy and is obtained by postprocessing the PW91 wave functions 
and eigenvalues. The implementation details are given in Ref. [17]. 
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For systems where the electronic states have either the occupancy 0 for conduc¬ 
tion states, n = c, or 1 for valence states, n = v one obtains the dielectric tensor in 
independent-particle approximation [ 1 8-20] (IPA) 


47ze 2 1 ^^ 

1 kcv 


l 


e c (k) - e v (k) - (hco + i i]) 


, (^ck+q/ Tk) (ttck+q/ l^vk) 


(6) 


where the sum is to be taken over the first BZ, q, is the reciprocal vector in the 
cartesian direction i, u „k are the periodic parts of the Bloch wave functions, s n (k) 
the respective eigenenergies, C2 is the crystal volume, and r] is the broadening. In 
order to allow for comparison with the experimental data we average over the three 
cartesian directions 

e(tico)=- ^2 su(hco). (7) 

i=x,y,z 

The dielectric function within the IPA or by solving the BSE is based on the 
electronic structure as obtained from either the PW91/HSE06 calculations (partially 
with scissors shifted eigenvalues) or from the GWA. 

Solving the Bethe-Salpeter equation includes the electron-hole attraction and 
local-field effects in the dielectric function. For practical calculations, the BSE 
is transformed into a two-particle Schrodinger equation. Neglecting dynamical 
screening and umklapp processes, the resonant part of the exciton Hamiltonian 
(Tamm-Dancoff-Approximation, TDA, cf. Ref. [21]) for direct transitions and spin- 
singlets can be calculated in reciprocal space according to 


AXvk' = (£? P (k)-eQ p (k))8 vv ,8ce'8 kk ' 
An 
~Q 


G.G' 


IGI 2 


e -i (k - k ' + G.k-k 7 + G', 0) okk , , Dkk ,„ 


|k — k' + G| : 


-B™(G)B™*(G r ) }. 


( 8 ) 


where the Bloch integral 

5,^,'(G) = J dru* k (r)e' Gr M„/ k / (r) (9) 

over the periodic parts u of the Bloch wave functions has been introduced. In 
the actual calculations we replace the inverse dielectric matrix 8 ~ 1 by a diagonal 
model dielectric function suggested by Bechstedt et al. [22], It depends on the static 
dielectric constant Soo and reduces the computational effort substantially. In case 
of inorganic semiconductors [23,24], molecular crystals [16,25] and even surfaces 
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[26], the application of this model dielectric function leads to rather accurate results. 
This is related to the fact that the model dielectric function depends on the local 
charge density and therefore carries some information about the local screening. For 
molecule calculations, the correct choice of soo is difficult. The authors of Ref. [27] 
defined an effective volume C2 e ff where the screening takes place in order to address 
this problem in their optical response calculation of poly-para-phenylenevinylene. 
In our work we use for molecular calculations s 00 = 1, which marks the lower limit 
for the screening interaction. If one assumes Q e jf = 18 3 A 3 , the IPA calculations for 
FAP result in s 0 o = 1.05, which leads to a blueshift of the excitonic eigenvalues 
by about 0.3 eV. Calculations for further values of Soo indicate a nearly linear 
dependence of the exciton binding energies on the screening, as may be expected. 
The dimension of the exciton Hamiltonian (Eq. 8) is determined by the size of 
the energy window for conduction and valence states. The spectra are calculated 
including either all states satisfying e c (k) — e v (k) < 6eV (DFT) or the lowest 96 
states (GWA). For the actual calculation of the spectra we use the time-evolution 
algorithm proposed by one of the present authors [26]. In addition to BSE-TDA, 
also calculations with the full exciton Hamiltonian were performed (BSE). For 
the comparison with measured optical spectra we use real and imaginary parts of 
the dielectric function, s'(fiw) and respectively to obtain the attenuation 

coefficient a using the approximation 



( 10 ) 


The calculated data are compared with optical absorption measurements on powder 
samples. 

The HLRS CRAY XE6 is the main computational resource used for the 
calculations in this work. As can be seen in Fig. 2, the scaling is nearly linear up 
to 200 cores, allowing for highly efficient calculations. 


3 Results 

The structural relaxation of FAP, OFAP and NFAP in gas phase shows that the 
geometry of the aminopyrimidine and pentafluorophenyl rings does barely change 
upon attachment of either a fluorine atom (FAP), a methoxy group (OFAP) or 
an amino group (NFAP). The comparison of our calculated data with x-ray data 
of two polymorphic crystals of the hydrogen analogue 5-phenyl-pyrimidin-2- 
ylamine (HAP) and a HAP-hexafluorobenzene co-crystal[7] as well as the recently 
crystallized NFAP ligand itself shows only small differences in bond length and 
angles. Only for the hydrogen bonds we observe deviations of up to 0.10-0.16A 
between measured and calculated data. The geometries calculated here also closely 
agree with Mpller Plesset perturbation theory (MP2) results for APM [28] : The bond 
lengths deviate by less than 0.02 A and the largest deviation of bond angles amounts 


to 3°. 
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a b 



Fig. 2 Wall clock time for the DFT-GGA calculation of 5-(pentafluorophenyl)pyrimidin-2-amine 
(FAP) in a 20 x 20 x 20A 3 cell including 1,152 electronic states on the HLRS CRAY XE6. In 
(a) the behavior of the wall clock time with respect to the number of cores and tasks per node 
(ppn ) is shown. As can be seen, the scaling is nearly linear up to 200 cores. The time required is 
reduced with increased distribution of the tasks on several nodes. Additionally, (b) shows that the 
wall clock time can be further reduced if the cores employed for the calculations are equally spaced 
(spacing 32/#ppn) on the nodes 


Table 1 Molecular 
excitation energies 
(in eV, see text) 



FAP 

OFAP 

NFAP 

gPW91 

3.46 

3.35 

3.00 

gHSE06 

4.53 

4.55 

4.21 

fiGoWo 

<7.7 

<7.4 

<7.1 

Ef 

7.36 

7.06 

6.47 

E ex 

3.51 

3.46 

3.21 

F J 

^ex 

3.50 

3.46 

3.22 

rr 

em 

2.08 

1.97 

1.98 

As 

1.43 

1.49 

1.23 


Starting from the relaxed structures we calculated the quantities defined in 
Eqs. (l)-(5). The results for FAP, OFAP, and NFAP are compiled in Table 1 . We find 
that the difference of the HOMO and LUMO eigenenergies, W91 = £/. — £ //, is 
largest for FAP and decreases by going from OFAP to NFAP (see also Fig. 4 for 
the electronic levels), i.e., with increasing electron-donating properties. In HSE06 
the ordering between FAP and OFAP is reverse compared to the GGA calculation. 
However, the gaps are very close. The trend observed with GGA holds also for the 
GqWq gaps Eg ° w ° and the ASCF gaps Eg 1 . The calculation of a QP ASCF gap 
requires the determination of the total energies E(N + 1,R) and E(N — 1,R) of 
charged molecules. Due to the interactions with the periodic images the dependence 
of the latter and thus the gap Eg P on the cell size is not negligible. In order to 
correct the calculated excitation energies, the gaps were determined for a cubic cell 
with varying size L = 18__ 30 A. 

As shown in Fig. 3, the gap values depend linearly on 1/L. Extrapolation to 
L — > oo leads to the gaps cited in Table 1. 
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size L. The filled/striped symbols for Go Wo values denote calculations with a cutoff for the 
response function of 60/40 eV (see text). The inset shows the respective values for the energy 
difference between the LUMO+1 and HOMO 


A dependence on the unit cell size is also noted for the calculated GoWo gaps, 
see Fig. 3. To some extent, this is to be expected due to the periodic repetition of the 
molecules. The restriction of the calculations with respect to further parameters due 
to numerical limitations, however, is even more important in the present case. The 
self-energy calculations for cubic cells with the size L = 18-20 A (22-24 A) were 
performed with a maximum cutoff for the response function of 60 eV (40 eV), 90 
frequency points, and a cutoff of 15-16 eV for the sum over empty states (including 
up to 1,056 bands). 

The dependence of the Go Wo on the numerics is obvious from the inset in Fig. 3, 
where the energy difference between the FAP HOMO and LUMO+1 states is 
shown, but also from Fig. 4, where the energetic ordering of the electronic states is 
visualized. Obviously, the order changes upon inclusion of electronic self-energies 
calculated with the GqWq approximation, but is itself not yet converged, at least 
for the unoccupied states. Nevertheless, as will shown below, the reordering due to 
state-dependent self-energy corrections calculated in Go Wo improves the agreement 
between the measured and calculated optical absorption. The present data suggest 
that the band gaps calculated within the GWA decrease with increasing cell size 
for the molecules studied here. The numbers given in Table 1 should thus be 
considered as approximate upper limits. We find that the values are by about 0.5 eV 
larger than the respective energy gaps determined from the ASCF calculations. 
The fundamental gaps calculated with the HSE06 scheme, on the other hand, are 
between the PW91 and the quasiparticle gaps. 

Interestingly, the quasiparticle shifts are nearly cancelled by electron-hole attrac¬ 
tion effects: The lowest electron-hole excitation energies E ex are remarkably close 
to the difference of the HOMO and LUMO single particle eigenenergies obtained 
from DFT. This near cancellation of many body effects due to the electron-electron 
and the electron-hole interaction suggests that optical excitation spectra calculated 
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Fig. 4 Energies of molecular orbitals as obtained from DFT (PW91) and GqWo calculations for 
cubic cells with L = 22 A (left) and L = 24 A (right). The influence of the self-energy corrections 
and cell size on the energy order of the states is indicated by different colors. Thick bars refer to 
the orbitals that correspond to HOMO, LUMO, and LUMO+1 in the PW91 calculations. The 
fundamental gap is indicated. Note the different energy-region for the empty states 


in the independent-particle approximation may be a reasonable description at least 
for the low-energy excitations. 

The calculation of the electron-hole excitation energies is computationally 
robust: The approaches according to Eqs. (2) and (3) result in energies that agree 
within 0.01 eV. The lowest-energy excitations calculated for structural relaxation 
differ appreciable from the respective vertical excitation energies. We calculate 
Stokes shifts between 1.2 and 1.5 eV for the three molecules. Thereby, the energetic 
ordering changes between absorption and emission. While NFAP is predicted to 
have the lowest vertical excitation energy, its deexcitation occurs at slightly larger 
energies than OFAP. 

Our calculated values are in reasonable agreement with the experimental data 
available: For FAP dissolved in ethanol Stoll et al. [8] measured a Stokes shift of 
1.28 eV. Given that the optical response of the molecules will be influenced by the 
solvent, these data confirm the validity of the present calculations. 

From the eigenfunctions and eigenvectors obtained in DFT one can directly 
calculate the dielectric function in independent-particle approximation. Figure 6 
shows the resulting spectra for FAP, OFAP, and NFAP. Obviously, in all three cases 
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Fig. 5 Orbital character of the states HOMO (a)-(c) and LUMO (d)-(f) for FAP, OFAP and NFAP 



2 3 4 5 6 

Photon energy (eV) 


Fig. 6 Imaginary part of the dielectric function calculated in independent-particle approximation 
for FAP, OFAP and NFAP. A broadening of rj = 0.10 eV has been used 

the onset of the optical absorption is larger than E FFT = sl — Eh due to the small 
transition probability between HOMO and LUMO. There are more similarities in 
the spectra. In particular FAP and OFAP agree largely concerning the positions and 
line shapes of the main peaks I-1V (see Fig. 6). 

Since the dielectric function in independent-particle approximation is composed 
of independent transitions between occupied and empty electronic states, it is 
straightforward to interpret. In particular we find that transitions between HOMO 
and LUMO+1 are essentially causing the first absorption peak for all three 
molecules. The data show furthermore that the optical absorption occurs largely 
due to states localized at the aminopyrimidine and pentafluorophenyl rings. This 
explains why the optical response of the three molecules shown in Fig. 6 is rather 
similar. A notable exception is the first absorption peak of NFAP. In this case 
the HOMO is strongly influenced by amino-group localized states (cf. Fig. 5). 
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Photon energy (eV) 

Fig. 7 Imaginary part of the dielectric function calculated by solving the BSE based on Go Wo 
calculations or by applying a respective scissors-shift to reproduce the A SCF-gaps or for FAP (a), 
OFAP (b) and NFAP (c). A broadening of r) = O.lOeV has been used. The solid (dashed)Zdotted 
cun-e and bars gives the spectra and oszillator strengths versus the eigenvalues calculated within 
BSE (BSE-TDA) on the basis of the scissors shifted PW91/GoWo electronic structure. The 
eigenvalues contributing to the first peak are labeled. See Ref. [23] for details. The strongest 
absorption maximum of FAP dissolved in ethanol[8] is shown by a dotted line 


Contributions of the attached fluorine atom or the methoxy group are - to a much 
smaller extent - also present in the first absorption peak of FAP or OFAP (cf. Fig. 5). 

In Fig. 7 the molecular dielectric functions calculated by taking many body 
effects into account are shown. The calculations have been performed using the full 
excitonic Hamiltonian as well as applying the TDA. The empty electronic levels 
were either shifted such that the respective molecular ASCF gaps are reproduced 
or the Go Ho electronic structure was used as input. The red-shift of the first peak in 
the NFAP spectra compared to FAP and OFAP as observed in IPA occurs also on 
the BSE level of theory. It is even enhanced by the smaller respective value of the 
ASC’F gap. In general the positions of the first optical absorption maxima calculated 
within the BSE agree within 1 eV with the IPA calculation, which is indicative for 
some cancelation of quasi-particle and excitonic shifts as already concluded from 
the values in Table 1. 

Similar to the IPA spectra discussed above we perform a systematic analysis of 
the states contributing to the respective absorption peaks (for details see Ref. [29]. It 
turns out that - as already found on the IPA level of theory - HOMO and LUMO+1 
are the states that mostly contribute to the first adsorption peak. 

Comparing the spectra obtained from the full Hamiltonian and in TDA one finds 
distinct differences: (i) a redshift of the eigenvalues going from TDA to the full 
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Hamiltonian and (ii) strong modifications of the line shape for energies above 4.5 eV 
(FAP, OFAP) or 4.0 eV (NFAP). While the Tamm-Dancoff-Approximation clearly 
affects the calculated optical absorption, in particular for excitations beyond the 
lowest absorption peak, we find the influence of the electronic structure that is used 
as input for the BSE calculations to be even more important. The optical spectrum 
based on the Go VFo electronic structure differs appreciable from the one based on 
scissors-shifted PW91 eigenvalues. This is due to the state-dependent self-energy 
corrections leading to an energetic reordering of the eigenvalue spectrum that results 
in a significant blue-shift of the optical absorption data. 

The measured position of the optical absorption peak of FAP dissolved in 
ethanol[8] in the energy window 2.3-5.7eV is at 4.72 eV (vertical line in Fig. 7). 
Clearly, the BSE spectrum based on the Go Wo electronic structure agrees best with 
this value. It yields an optical absorption peak at 4.48 eV. From Table 1 it is clear 
that the error bar of the calculated excitation energies is of the order of several 
tenths of an eV. Moreover, our choice for the static dielectric constant used in the 
molecule calculations is bound to result in excitation energies that approach the real 
values from below. An additional uncertainty in the experiment-theory comparison 
is related to the fact that the solvent molecules are not included in the present gas- 
phase calculations. Therefore the deviation between measured and calculated data 
of less than 0.3 eV is not surprising. 

Comparing the computational results for the electronic states of FAP, OFAP, and 
NFAP summarized in Fig. 4 and the optical response from Figs. 6 and 7 one finds 
that the former are far more sensitive to the attachment of functional groups than the 
latter. Since the optical absorption essentially takes place at the aminopyrimidine 
and pentafluorophenyl rings, modifications in the molecular wave functions due to 
methoxy or amino group are only partially reflected in the optical data. 


4 Summary 

In the present work the electronic structure and optical response of 2- 
aminopyrimidines is analyzed on the basis of DFT as well as many-body 
perturbation theory calculations. The calculations predict quasiparticle gaps, i.e., 
differences between the ionization energies and electron affinities, of about 7 eV 
for the molecules. The energies of the lowest optical excitations of the respective 
molecules are considerably lower. In fact, our result indicates a near cancellation 
of the electronic self-energy and exciton binding energies for the lowest excitations 
of 2-aminopyrimidines. In addition to electron-hole attraction effects, we observe 
a very strong influence of local fields, i.e., the unscreened electron-hole exchange 
on the optical absorption spectra. Moreover, the resonant-nonresonant coupling 
terms in the excitonic Hamiltonian usually neglected in calculations for inorganic 
semiconductors are found to noticeably modify peak positions and oscillators 
strengths in case of the systems studied here. 
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Spinodal Decomposition Kinetics 
of Colloid-Polymer Mixtures Including 
Hydrodynamic Interactions 


Alexander Winkler, Peter Yirnau, and Kurt Binder 


Abstract The phase separation dynamics of a model colloid-polymer mixture is 
studied by taking explicitly the hydrodynamic interactions caused by the solvent 
into account. Based on the studies on equilibrium phase behavior we perform a 
volume quench from the homogeneous region of the phase diagram deep into the 
region where colloid-rich and polymer-rich phases coexist. We demonstrate that 
the Multiparticle Collision Dynamics (MPCD) algorithm is well suited to study 
spinodal decomposition and present first results on the domain growth behavior of 
colloid-polymer mixtures in quasi two-dimensional confinement. On the one hand 
side we find that the boundary condition of the solvent with respect to the repulsive 
walls strongly influences the phase separation dynamics and on the other hand we 
show that the wetting behavior of the system leads to changes in the demixing 
pattern morphology over time and hence affects the domain growth laws. 


1 Project Description 

This project deals with computer simulations of suspensions of colloidal particles 
in confinement under non-equilibrium conditions. Colloidal suspensions are a 
prototype system of soft matter, and the softness of these systems make them ideally 
suited for studying them under far from equilibrium conditions. 

The term colloid in general refers to systems that are structured on the nano- 
to micrometer scale. Food, detergents, paints and many biological substances are 
colloids. In the context of soft matter physics, however, the term is usually restricted 
to suspensions of nano- to micrometer sized particles or droplets dispersed in a 
liquid. Additionally to their obvious use in technological applications, colloids are 
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interesting model systems for research on statistical mechanics. The interactions 
between colloidal particles can be tuned e.g. by addition of salt to suspensions of 
charged particles or by grafting polymers onto their surfaces, and a large number 
of experimental techniques are available to study their properties (for overviews see 

[1-5]). 

In binary fluid mixtures, the growth of the linear dimension l(t ) with time f after 
the system was quenched from the one-phase region into the two-phase region can 
be described by simple power laws /(f) oc f“. The exponent a not only depends 
on the dimensionality of the system, but also whether or not hydrodynamic effects 
are considered. In two dimensions, droplet coagulation yields /(f) oc f° 5 [6], while 
the Lifshitz-Slyozov mechanism gives /(f) oc f l>/3 [7] like in three dimensions (see 
[8] for a recent study of a pure two-dimensional system). For compositions, where 
domain structures are bicontinuous, hydrodynamic mechanisms yield /(f) oc f in 
three dimensions and / (f) oc f 2 / 3 in two dimensions if / = rj 2 / ( py ) (here p is 

the shear viscosity, p the density and y the interfacial tension between A and B-rich 
domains) [9], However, experiments and simulations often find slow transients 
before asymptotic power laws occur [10,11], and for two-dimensional systems it 
is still controversial whether a scaling description in terms of universal power laws 
holds at all. In quasi two-dimensional systems the situation becomes even more 
complex because the wetting behavior of the system leads to additional dynamics 
perpendicular to the walls which affect the coarsening pattern morphology. There 
exist reports where the Lifshitz-Slyozov (a = 1/3) power law for droplet patterns 
was exceeded in quasi two dimensions which was commented as a result of a wetting 
layer backflow [12,13], 

In this project we shall study the dynamics of a model colloid-polymer mixtures 
between two walls. Polymers are described as soft spheres weakly interacting 
with each other, while colloid-polymer and colloid-colloid pairs interact with 
the (repulsive) Weeks-Chandler-Andersen potential, so that a depletion attraction 
between colloids results, similar to the Asakura-Oosawa model. This model was 
developed and tested by Zausch et al. [14], In this context, the phase behavior of 
this model was determined in the bulk. 


1.1 Description of Methods and Algorithms 

In mesoscale simulations a simple continuum description based on the Navier-Stokes 
equations is sometimes not sufficient. A fully atomistic approach on the other hand 
may also not be well suited since only the short time behavior is accessible with 
present-day computers due to the high number of microscopic degrees of freedom. 

The Multi Particle Collision Dynamics (MPCD) algorithm [15-17] is easy to 
implement and a comparatively fast approach which takes advantage of explicit 
ideal solvent particles for which analytic expressions, e.g. transport coefficients, 
exist. The basic idea is to extend the ordinary Velocity Verlet based Molecular 
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Fig. 1 Phase diagrams of the continuous AO-model, (a) Colloid coexistence densities as a 
function of polymer reservoir packing fraction rf = exp(fi)i p ) (here, d p is the polymer 
diameter, fi p is the chemical potential of the polymers and fi is the inverse temperature) for wall 
distances D = 5 and D = 10. (b) The same colloid coexistence densities as a function of polymer 
packing fraction ij p for wall distance D = 5. In both plots the crosses represent the bulk phase 
diagram data for comparison 


Dynamics (MD) simulations by a stochastic collision step followed by a streaming 
step of the solvent. The momentum is locally conserved by the division of the system 
into cells which contain the ideal, point like solvent particles with mass m and 
velocity v. The collision step consists of an independent random rotation of the 
relative velocities v, — u of the solvent particles in each cell, where u is the average 
velocity in the cell. Then, a streaming step is performed where the solvent particles 
propagate the distance x = v,-r, r being the discretization time step of the solvent. 

There are different techniques to couple the embedded fluid particles (like 
colloids and polymers) to the solvent. The simplest way is to exchange momentum 
between solvent particles and fluid particles by inserting also the fluid particles in 
addition to the solvent particles into the cells during the collision step [18]. 


1.2 Previous Work 

We were able to determine the static properties of the continuous Asakura-Oosawa 
model confined between two planar walls by means of free energy calculations. We 
used cluster-move based grand canonical Monte Carlo simulations [14,19] together 
with successive umbrella sampling [20] and parallel Wang-Landau sampling [21, 
22] to calculate the coexistence densities and interfacial free energies for various 
parameter sets. The resulting phase diagrams are shown in Fig. 1. 
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2 Progress Report (Jan-Apr. 2012) 

We were able to implement and test extensively the MPCD algorithm. We used 
the standard halo layer domain decomposition technique to treat the embedded 
particles and the parallelization approach proposed by Sutmann et al. [23] for the 
solvent particles. This level of parallelization allows us to use 1,024 cores (or more) 
to study successfully the phase separation kinetics in the Asakura-Oosawa model 
in huge systems (max. 1,100,000 MD particles and 52,000,000 solvent particles) 
over multiple scales of MD time. We performed a volume quench (see Fig. lb) 
from the homogeneous phase into the region where the colloid-rich phase and the 
polymer-rich phase coexist and recorded the average domain growth as a function 
of MD time. The investigation of spinodal decomposition kinetics of colloid- 
polymer mixtures, which are intrinsically asymmetric binary mixtures [14], is a 
very challenging problem in terms of performance on parallel super computers. The 
domain growth of colloid and polymer phases leads to an asymmetric distribution 
of the computational effort, because the polymer-rich phase consists typically out 
of two to four times more particles than the colloid rich phase. This fact leads to a 
strong slowdown of the simulation with respect to the Molecular Dynamics force 
calculation when the system demixes over time. The treatment of solvent particles 
in the framework of MPCD algorithms hides this imbalance between the various 
processes, so that we achieve almost the same performance as in simple Molecular 
Dynamics simulations even though 52,000,000 solvent particles are present (see 
Fig. 2). The simulations corresponding to the NOHI (no hydrodynamic interactions) 
performance curve are very closely related to standard MD simulations, since 
no explicit solvent particles are present and the overall performance is set by 
the velocity Verlet integration step. We conclude that the problem of spinodal 
decomposition is very well suited for MPCD algorithms, since the imbalance of 
computational effort in the velocity Verlet integration of fluid particles gets hidden 
by the homogeneous computational effort which is needed to handle the solvent. 
For huge systems at the late time stages of phase separation the net cost of explicit 
solvent particles in the framework of MPCD is only about approximately 10% in 
comparison to standard Molecular Dynamics simulations. 

For the first wall distance under consideration, D = 5, which is rather small in 
comparison to the lateral system dimensions of L x = L y = 256 (unit length scale 
is a colloid diameter), the system is expected to exhibit spinodal decomposition 
kinetics with a two-dimensional character. Via the MPCD algorithm it was possible 
to change the solvent properties like wall boundary conditions and hydrodynamic 
coupling without influencing the static properties of the model. From the results 
for wall distance D = 5 we are able to conclude that without hydrodynamic 
interactions (NOHI) the phase separation kinetics is distinctively slower than with 
included hydrodynamic interactions (MBST Stick, MBST Slip). Moreover, we 
discover a strong influence of the confinement due to the boundary conditions of 
solvent particles. When the solvent interacts via perfect slip (specular reflection 
at the walls) we discover that the domain growth is accelerated in comparison 
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Fig. 2 Performance of the MPCD simulations on the super computer Hermit for system size 256 x 
256 x 10 for two different thermostat types as a function of MD time. The MBST (Maxwell 
Boltzmann Stochastics Thermostat) curve corresponds to fully included hydrodynamics by explicit 
solvent particles, while in the NOHI (no hydrodynamic interactions) performance curve no explicit 
solvent particles are present. Arrows indicate a change in the number of used cores. The overall 
performance for 1,024 and 4,096 cores was divided by 4 and 16 respectively. The sharp minima in 
the MBST data correspond to checkpoints where a full snapshot is written to the hard disk 


to stick boundary conditions (bounce back reflection rules). When switching off 
hydrodynamic interactions the slowest domain growth mechanism is observed (see 
the snapshot series in Fig. 3). The development of the average domain size l t ( over 
time is shown in Fig. 4. The domain growth law is altered from an 1/3 power law 
behavior to a 2/3 power law when comparing perfect stick boundary conditions and 
perfect slip boundary conditions [24]. We believe to explain this change in the phase 
separation kinetics via hydrodynamic screening caused by the walls. 

We discover further a non-monotonic behavior of the dynamics perpendicular to 
the walls which influences the morphology of the observed demixing patterns (see 
the transition from percolated structures to droplets in the case of slip boundaries 
in Fig. 3). These dynamics perpendicular to the walls result from the fact that the 
colloids try to form a wetting layer at the walls while the polymers do not favor the 
walls. This becomes even more pronounced in the case of wall distance D = 10. 
The morphology of the coarsening pattern plays a crucial rule with respect to the 
growth law dynamics. Only in the case of percolating structures the hydrodynamic 
long range interactions are expected to cause a 2/3 power law for the late times. In 
contrast, the coarsening via droplets is predicted to happen according to a power law 
with exponent 1/3. 

For quasi two-dimensional systems we find that the dynamics perpendicular to 
the walls change the morphology of the phase separation patterns over time. By 
choosing different ratios between the colloid and polymer concentration (labeled 
as quenchl, quench2, quench3 respectively) we can influence the morphology and 
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Fig. 3 Snapshot series of the demixing process. Only the polymers are shown (black dots). Time 
increases from top to bottom. The left series corresponds to hydrodynamics with perfect slip 
boundary conditions. The middle column corresponds to hydrodynamics with stick boundary 
conditions and the right column corresponds to switched off hydrodynamics, (a) Time t = 12. 
(b) Time t = 1,200. (c) Time t = 12,000. (d) Time t = 30,000 


hence, the domain growth laws (Fig. 5). The Euler characteristic (Fig. 5a) represents 
a measure of morphology, values larger than zero correspond to a droplet structure, 
values below zero correspond to a structure dominated by holes and values close 
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Fig. 4 Growth of the average domain size as a function of MD time (1 MD time unit 
corresponds to 500 integration steps) for three different solvent properties in a system of size 
256 x 256 x 5. Red squares represent the data for switched off hydrodynamic interactions (NOHI), 
while the blue diamonds and black full circles show the data obtained with the Maxwell Boltzmann 
Stochastics Thermostat (MBST) with perfect slip and perfect stick boundaries respectively 


to zero represent percolation patterns. The data becomes meaningful from time 
t = 600 on, when colloid and polymer curves start to behave in an antisymmetrical 
manner. The data was obtained from single runs and has a statistical error of 
typically <5/ ss 0.005. Part (b) of the same figure shows the corresponding growth 
laws of the average domain size the corresponding decomposition pattern of which 
are shown just beneath in the same figure. Here, the perfect slip boundary condition 
was used, since only in the case of not screened hydrodynamics we expect to see a 
difference in the domain growth with respect to the underlying morphology. One can 
clearly see that being closer to a percolation pattern (quench3) leads to the fastest 
domain growth while crossing the percolation line slows down the demixing process 
drastically (which might be evidence for the so-called pinning effect [25-27]). 
From these first results for D = 10 we conclude that the spinodal decomposition 
of asymmetric binary mixtures (with respect to their bulk phase behavior as well 
as with respect to the condensation behavior at the walls) in quasi two-dimensional 
confinement does not show a universal behavior of the domain growth but rather 
the dynamic changes in morphology influence strongly the growth mechanism 
of the decomposition patterns. Additionally we see that in contrast to pure two- 
dimensional systems a screening of the hydrodynamic solvent correlations due to 
walls is present and that hydrodynamic acceleration of the coarsening mechanism 
can be only present for length scales comparable to the wall separation D in the 
case of stick boundary conditions. 
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time t « 1500 





time t « 6700 


Fig. 5 (a) Euler characteristic for colloid and polymer pattern at various concentration ratios using 
an MBS thermostat with slip boundaries, (b) Average domain size /j for the three different colloid- 
polymer concentrations obtained from a single run as a function of time. As a guide for the eye 
power laws with different slopes are included. At the bottom a snapshot series for the various 
concentration ratios is given. Only the polymers are shown (black dots) 


3 Outlook 


It is planned to complete the studies for wall distance D = 10 with the focus on 
quench 3 and compare similar as for D = 5 the influence of hydrodynamics on 
the phase separation kinetics. Furthermore, the extreme case of an almost pure 
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two-dimensional system with D = 1.5 (where two particles do not fit on top of 
each other anymore) will be studied to clarify to which extent the dimensionality 
influences the spinodal decomposition and whether it is a continuous or a more 
sharp transition of the domain growth behavior when the system gets access to the 
third dimension. The extreme case of D = 1.5 is additionally useful in the sense 
that the morphology pattern is not influenced dynamically anymore by a competing 
dynamics perpendicular to the walls. Finally, we want to address the question 
whether wetting layer backflow (which can only exist in quasi two-dimensional 
systems) can enhance the phase separation kinetics in the case of droplet patterns. 
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Multi Relaxation Time Lattice Boltzmann 
Simulations of Multiple Component Fluid 
Flows in Porous Media 


Sebastian Schmieschek, Ariel Narvaez, and Jens Harting 


Abstract The flow of fluid mixtures in complex geometries is of practical interest 
in many fields, ranging from oil recovery to freeze-dried food products. Due to its 
inherent locality the lattice Boltzmann method allows for straightforward imple¬ 
mentation of complex boundaries and excellently scaling parallel computations. 
The widely applied Bhatnagar Gross Krook (BGK) scheme, used to model the 
contribution of particle collisions to the velocity field, does however suffer from 
limitiations in precision that become more prominent with increasing surface to 
volume ratio. To increase the accuracy of simulations of mixtures in porous media, 
we integrated a so-called multi relaxation time (MRT) collision scheme with a 
pseudo-potential method for fluids with multiple components. We describe some 
optimisation details of the implementation and present test results verifying the 
physical accuracy as well as benchmarks obtained on the XC2 Opteron cluster at 
the Scientific Supercomputing Centre Karlsruhe (SSCK). 


1 Introduction 

Fluid flows involving multiple components and phases are of interest in many 
different fields of application, e.g. oil recovery [1], soil treatment and decontami¬ 
nation [2], food processing [3] and fuel cell optimisation [4]. This has motivated 
ongoing theoretical [5] and experimental [6] as well as numerical [7-12] research. 
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An important measure to describe the behaviour of fluid flows in porous media 
is the permeability. It is relating an externally applied mean pressure gradient to a 
resulting mean flow rate through the medium. 

Numerous numerical models have been utilized to model flow in porous media, 
exhibiting strengths in different areas [7-9, 13]. While finite element methods 
(FEM) allow for a very efficient calculation of certain systems, the incorporation 
of complex boundary conditions may become very tedious. Particulate models 
such as dissipative particle dynamics (DPD) or molecular dynamics (MD) ease 
the integration of various fluid components but require time averaging to produce 
strong statistics. Furthermore, microscopic modeling as in MD does not allow to 
reach length- and time-scales relevant for the simulation of porous media. While in 
direct comparison of algorithmic costs to macroscopic (FEM) and other mesoscopic 
methods (DPD), lattice Boltzmann methods (LBM) seem to be expensive, their 
inherent thermostatistics supresses noise in the results and thus makes the method 
very competititve. In adition, the localised nature of the model eases parallelization 
as well as the integration of complex boundary geometries. 

In the decades since its introduction, the LBM has been subject to several 
extensions, e.g. to account for multiphase flows [16-18], thermal transport phe¬ 
nomena [19,20] or reactive flows [21]. Furthermore weaknesses of the model were 
discovered and accounted for [22,23]. One prominent issue of the widely used 
Bhatnagar Gross Krook (BGK) collision scheme is the unphysical dependence of 
the position of simple bounce back boundaries on the relaxation time parameter r. 
While in straight channel geometries, the thusly introduced error is easy to account 
for, in complex boundary geometries a significant relaxation time dependent error 
is observed [12,23]. Amongst several other improvements, the introduction of a so- 
called mutlti relaxation time (MRT) collision scheme to the LB algorithm allows to 
mend this shortcoming. 

Since many flow situations in porous media involve multiple fluid phases or 
components, a consistent algorithm which takes phase separation or interfacial 
tension into account is required. In this contribution we report aspects of the 
integration of a MRT collision scheme with our pseudo-potential multiphase 
lattice Boltzmann implementation lb3d and present results of ongoing physics- and 
performance benchmarks towards a more acurrate and precise simulation tool for 
multiphase flows in porous media. Figure 1 shows example visualisations of our test 
cases: on the left hand side a rendering of the model porous medium, a BCC sphere 
packing and on the right hand side three snapshots of the domains of two immiscible 
fluids in a ternary mixture for different concentrations of an amphiphilic species. 


2 Simulation Method 

The lattice Boltzmann equation 

fk (x + c k At, t + At)- f k (x, t) = Q{f k , / A eq ) 
' *><*-' ^ 


Advection 


Collision 




MRT LBM for Multi Component Flows in Porous Media 


41 



Fig. 1 Left: Rendering of the surface area of the model porous medium used in our simulations. 
A body centred cubic (BCC) sphere packing. The permeability k of this system is analytically 
accessible [14, 15]. Right: Rendering of the dynamics of the spinodal decomposition of two 
immiscible components in a ternary mixture. Depicted is the state reached after simulation of 
10,000 timesteps with surfactant concentrations of C s = {0, 18.75, 31.25 %} 


is modeling fluid flows in terms of the single particle distribution function /. It is 
discretised on a lattice with velocities Ck ,k = 0 ..b. Our model relies on the so-called 
D3Q19 lattice, comprised by b = 18 velocities in direction of the face and edge- 
centers of a three dimensional cube, as well as an additional zero-velocity. The time 
development of / is described by an undisturbed linear motion (advection) and 
an operator Q representing particle interactions. The time step At and the length 
of the lattice constant Ax are further on chosen to equal one. The expression / A eq 
designates the local equilibrium distribution, depending on the conserved properties 
density p and mean velocity u. Its form can be chosen depending on the problem 
to simulate. In the scope of this work a third order expansion of the local Maxwell 
distribution with the lattice speed of sound c s = 1/V3 and lattice weights wg is 
used [24]: 
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For the BGK approach the collision operator 

^BGK = ~ [/*(*, t ) - fP(x, 0] , 

with a single relaxation time r is used. The relaxation time is determining the 
kinematic viscosity v = c| (r — i). With the above described choices the equation 
of state (EOS) P = fk c s = P c s = P^bT of an ideal gas with density p and 
Temperature T is recovered, where kn denotes the Boltzmann constant. 
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While this formulation readily allows to solve the Navier-Stokes equations up 
to second order accuracy, there exist several limitations to the model. Due to the 
single relaxation time scale the Prandtl number, i.e. the ratio of momentum to energy 
transport, is fixed. In the low viscosity regime r —»• 0.5 strong numerical instabilities 
are observed. Furthermore, the position of a bounce-back (no-slip) boundary is 
viscosity dependent. 

These issues can be adressed by introducing a different collision operator, 
allowing to vary the Prandtl number and increasing the overall numerical stability. 
The collision operator of the generalised (MRT) LBM can be written as [25] 

^mrt = —M~ l SM [|/(x,0> - |/ eq (x,0>] - 

Herein M is an invertible transformation matrix, relating the stochastical moments 
of the single particle velocity distribution / to linear combinations of its discrete 
components /*.. It can be obtained by a Gram Schmidt orthogonalization of a matrix 
representation of the stochastical moments. The latter can be related to physical 
properties such as density p, velocity u, momentum p, energy e, etc. The collision 
process is performed in the space of moments, where S is a diagonal matrix of 
the individual relaxation times. Additional to the shear viscosity, the bulk viscosity 
£ = ?c? ( — i ) can be varied. 

’ 3 s V, T bulk 2 J 

The Shan Chen pseudo-potential method alters the local mean velocity 

Ok* * Pa)/T-a F • T a 

U = -1-, 

£<* (Pa)/r a m a 

common to all species a of density p and mass m by introducing a phenomenologi¬ 
cal force 

f ““ = = -r w E g aS E 

k =0 

which is defined to be proportional to a functional of the fluid densities i/r“ = l—e~ p 
and a coupling parameter g a a determining the nature and magnitude of interaction 
between the components a and a. This accounts for additional (non-ideal) terms 
in the equation of state [16,26]. For attractive self-interactions (g aa < 0 j of a 
magnitude larger than a critical value g c a phase separation in a vapour and a liquid 
phase can be observed. Repulsive interactions between two components (g uu > 0) 
are utilised to model systems of partly miscible or immiscible fluid mixtures. While 
the input parameters are determined strictly phenomenological, this approach has 
recently been shown to be equivalent to the explicit adjustment of the free energy of 
the system [27]. 

Amphiphiles (surfactants) are comprised of a hydrophilic (water-loving) head 
group and a hydrophobic (oil-loving) tail. Amphiphilic behaviour is modeled by a 
dipolar moment d with orientation ft defined for each lattice site. It is relaxing in a 
BGK-like process, where the equilibrium moment is dependent on the surrounding 
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fluid densities [28]. The introduction of the dipole vector accounts for three 
additional Shan Chen type interactions, namely an additional force-term 


■2i/r“(x, t)g as ^ d(x + c k J) ■ & k ^ s (x + c k ,t) 
0 


for the regular fluid components a imposed by the surfactant species 5. Therein, 
the tilde denotes post collision values and the second rank tensor 0^=1— 3^rS 
with the identity operator I weights the dipoles force contribution according to 
the orientation relative to the density gradient. The surfactant species is subject to 
forcing as well, where the contribution of the regular components a is given by 


2iA'(x, ?)d(x, t) ■'Y^gas^ ®kf a (x + c k , t). 


a kj^ 0 


and 



is the force due to self-interaction of the amphiphilic species [29], 


3 Implementation 

The lattice Boltzmann implementation lb3d provides functionality to simulate three- 
dimensional simple, binary immiscible and ternary amphiphilic fluid systems using 
the Shan Chen pseudo-potential model for non-ideal fluid interactions. 

The boundary conditions available include periodic boundaries, body forcing, 
and bounce-back boundaries as well as Lees-Edwards shearing for simple and 
binary fluid mixtures. The software is written in Fortran 90 and parallelized using 
MP1. It supports XDR and parallel HDF5 formats for I/O and provides checkpoint 
and restart for long-running simulations. 

The code has been developed at University College London, University of 
Stuttgart and Eindhoven University of Technology. It has been ported to many 
supercomputers worldwide, where it has shown excellent scalability. Recently it 
has been shown to scale linearly on up to 262,144 cores on the European Blue 
Gene/P system Jugene [30], We were able to confirm very good scaling of our 
ternary systems on the Opteron based XC2 as well (Fig. 2, left). 
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Fig. 2 Left: Speedup measure for a ternary system of 256 3 lattice sites on the XC2 at SSC 
Karlsruhe. The solid line represents ideal scaling relative to 16 core performance, the dotted line 
represents ideal scaling relative to 128 core performance. We observe very good scaling. Right: 
Performance comparison, measured in thousand lattice updates per second (kLUPS) produced on 
a single processor core based on the Intel Nehalem microarchitechture, running at a clock rate 
of 2.4 GHz. Results for a variation in component count using a BGK or MRT collision scheme, 
respectively. The components a and b denote immiscible Newtonian fluids, whereas component s 
is an amphiphilic fluid species. With increasingly complex fluid interaction calculations, the 
contribution of a more costly collision scheme is rendered less important 


In recent years, lb3d has been coupled to a molecular dynamics solver in order 
to simulate complex fluids containing particulate components, such as blood [31], 
or nano-particle stabilised emulsions and suspensions [32], Moreover, the code has 
been used to study for example the self-assembly of cubic mesophases [33], micro¬ 
mixing [34], flow through porous media [10] and fluid surface interactions [35,36]. 
A refactored version of the code with limited functionality that focuses mainly on 
multiphase fluid simulation functionalities, has recently been released under the 
LGPL [37], 

In order to improve the accuracy and numerical stability of Ib3d with respect 
to the application to the simulation of multiphase flows in porous media, recently 
a MRT collision model was integrated with the software. While the MRT collision 
algorithm is more complex than the BGK collision scheme and can cause significant 
performance loss when implemented naively, the increase in calculation cost can 
be dramatically reduced. We take advantage of two properties of the system to 
minimize the impact of the additional MRT operations on the code performance: 

1. The symmetry of the lattice allows to precalculate the sum and difference of 
discrete velocities which are linear dependent, thus saving at least half of the 
calculation steps [38]. 

2. The equilibrium stochastical moments can be expressed as functions of the 
conserved properties density and velocity, thus saving the transformation of the 
equilibrium distributions [25]. 

As such, the performance penalty could be reduced below 17 %, which is close to 
the minimal additional cost reported in [25]. Since in multiphase systems the relative 
cost of the collision scheme is further reduced, the use of the MRT scheme has even 
less impact on the performance (Fig. 2, right). 
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Fig. 3 Left: Double logarithmic plot of the dynamics of the average domain size for different 
surfactant concentrations. After an initial phase, resulting from the chosen homogenous random 
initialisation of the fluid densities, the average domain size follows a power-law. Increasing 
surfactant concentration is slowing down the domain growth. With exceeding a concentration of 
C s Rs 35 % the critical micelle concentration is reached and a stable microemulsion with a minimal 
domain size is formed. Right: Time exponent of the dynamics of the average domain size as a 
function of the surfactant concentration, detennined by a fitted power-law behaviour depicted on 
the left-hand side. At low concentrations C s 10 % the impact of added surfactant on the domain 
size dynamics is close to negligible. As the concentration approaches the saturation of the interface 
the retardation of the decomposition becomes very sensitive to added surfactant 


4 Results 

To evaluate the functionality of the newly implemented MRT collision scheme, 
simulations of the influence of surfactant concentration on the decomposition 
process of two immiscible fluid components, earlier performed with Ib3d utilising 
the BGK method are reproduced [33]. 

While keeping the total density of all fluid components constant at p tot = 0.8 - 
equivalent to constant pressure - the concentration of the surfactant component is 
varied between 0 and 37.5 % in steps of 6.25 %. Simulations are carried out on 
a domain of 256 3 lattice sites. The coupling parameter between the immiscible 
components b (blue) and r (red) is set to g/, r = 0.08. The coupling parameters to 
the surfactant species are set to g as = —0.006 and g ss = —0.003, respectively. The 
fluid components are initialised at homogenous density distribution and a random 
dipole orientation. The average domain size is extracted by evaluation of the Fourier 
transformation of the field of the order parameter & = pb~ p r which becomes zero 
at the interface of the immiscible components [33]. 

As can be observed from Fig. 3, two regimes of the dynamics can be distin¬ 
guished. At low to intermediate surfactant concentrations ( C s < 35 %) spinodal 
decomposition is taking place, resulting in completely demixed fluids. Here, after an 
initial phase related to the formation of domains from the homogenous initialised 
fluids, the average domain size clearly follows a power-law with respect to time. 
Figure 3 (left) shows a double-logarithmic plot of the dynamics of the average 
domain sizes for varying surfactant concentration. The extracted numerical values 
for the power-law relation can be found in Fig. 3 (right). Average domain sizes 
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exceeding 80 lattice sites have been excluded from the fit, as in this range the 
periodic boundary conditions begin to influence the results. In the range of higher 
concentration close to the critical micelle concentration (C s = 31.25 %) the initial 
phase is extended considerably as compared to lower concentrations. The impact of 
adding surfactant on the dynamics of the average domain size is almost an order of 
magnitude large here as compared to the low concentration range around 10 %. With 
exceeding a concentration of Q ss 35 % the critical micelle concentration is reached 
and stable microemulsions with a constant domain size are formed. These results are 
in agreement with the findings of the reference study and allow confidence in the 
MRT collision scheme. 

To test the behaviour of the MRT scheme in a complex geometry, the flow of two 
miscible phases through the model BCC sphere packing porous medium, illustrated 
in Fig. 1 (left), is simulated. The permeability for this model system 


6 Jt/h 

can be derived analytically in terms of the side length of the unit cell a, the 
dimensionless drag h and a geometry-factor / giving the ratio of actual to maximal 
sphere radius. We choose the maximum radius here, so / = 1. The dimensionless 
drag can be shown to depend on the system geometry only and can be expressed as 
an expansion of / [14,15]. 

In our simulation setup, we measure the permeability using Darcy’s law 


k = —pv 


M 

(VP)’ 


with the fluid density p, the kinematic viscosity v, the average velocity ( u) and 
the average pressure gradient (V/ > ). For the test runs two miscible fluids of 
equal density p r = pi, = 1.0 and equal kinematic viscosity, varied by the relaxation 
parameter r, are driven through the porous medium by applying a body force of 
F = 0.001 in lattice units in a four lattice sites long layer located in the center of a 
buffer region of 20 lattice sites width. 

Figure 4 shows a comparison of absolute permeability measurements conducted 
in simulations with varying relaxation parameter r using the MRT and BGK 
collision scheme, respectively. In these preliminary simulations the only relaxation 
parameter modified in the MRT scheme is the one related to the dynamic viscosity, 
all other parameters are kept to equal 1 or 0 for parameters related to conserved 
properties (p and u). In this configuration at r = 1, the MRT scheme is therefore 
equivalent to the BGK scheme, as can be observed from the identical outcome 
of the permeability measure at that point. However, already for this parameter 
set, the dependence of the system’s behaviour on the relaxation parameter ist less 
pronounced. It has been shown that this dependence can be eliminated by a suitable 
choice of parameters [23]. However, previous attempts by other groups were not 
combined with the Shan Chen multicomponent method and thus did not allow to 
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Relaxation time x 

Fig. 4 Preliminary results of measurements of absolute permeability of the model BCC sphere 
packing porous medium for varying relaxation times r. In this case we use non-optimised values for 
the MRT relaxation parameters, all equal to 1. In the case of r = 1.0, this reduces the MRT to the 
BGK-scheme as is here illustrated by identical results. By adjusting the relaxation parameter set, 
the dependence of the permeability measurement on the relaxation parameter can be completely 
resolved [23], Since equal fluid properties were chosen, the relative permeability K a /K is 0.5 for 
all cases 


simulate multicomponent flow in porous media. All relative permeabilities K a /K 
measured were found to be exactly 0.5 which is to be expected, since in the runs 
presented here all properties of the liquid components were set to be equal. 

These promising results allow to proceed with further tests, such as to optimise 
the MRT parameter set and study the accuracy of relative permeability measure¬ 
ments for fluid mixtures with different properties. Possible future applications of 
this model include then permeability calculations for ternary amphiphilic mixtures 
and studies of imbibition and demixing processes in porous media, where our model 
can take effects of wettability into account as well. 


5 Summary 

To improve accuracy and efficiency of the application of our lattice Boltzmann 
implementation for fluid flows with multiple components in porous media we 
have implemented a multi relaxation time collision scheme. The steps taken to 
optimize the performance have allowed to limit the increase in calculation time close 
to the expected minimum. Furthermore we could show that for multi-component 
systems the additional calculation time spent in the collision step is very small in 
comparison to the time increase due to the calculation of the interaction forces. 
The reproduction of prior simulations of the dynamics of an amphiphilic fluid 
mixture as well as a currently modest but optimisable improvement in accuracy 
on a broader scale of viscosity values allow confidence in the implementation. 
Preliminary tests of permeability measurements for multicomponent flows in porous 
media are promising and allow to proceed further towards productive application of 
the model to the simulation of ternary amphiphilic mixtures in porous media. 
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Abstract In the frame of planet formation by coagulation the growth step from 
millimetre-sized highly porous dust aggregates to massive kilometre-sized plan- 
etesimals is not well constrained. In this regime of pre-planetesimals, collisional 
growth is endangered by disruptive collisions, disintegration by rotation as well 
as mutual rebound and compaction. Since laboratory studies of pre-planetesimal 
collisions are infeasible beyond centimetre-size, we perform numerical simulations. 
For this purpose, utilise the parallel smoothed particle hydrodynamics (SPH) code 
parasph. This program has been developed to simulate macroscopic highly 
porous dust aggregates consisting of protoplanetary material. We briefly introduce 
our porosity model and use it to perform simulations on the growth criteria of 
pre-planetesimals. With the aid of parameter studies we investigate fragmentation 
criteria in dust collisions depending on aggregate size and aggregate porosity. We 
extend a previous study on bouncing criteria of equally sized aggregates depending 
on their porosity and the presence of compacted shells of various porosities. 
Regarding the rotational stability of highly porous dust aggregates we theoretically 
derive fragmentation criteria for dust cylinders depending on angular velocity as 
well as porosity and perform suitable simulations. 
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1 Introduction 

According to the coagulation scenario planet formation occurs in protoplanetary 
discs by a sequence of successive mutual collisions from micrometre-sized dust 
grains to planets. Three regimes can be distinguished: during the first step from 
grains to millimetre-sized pre-planetesimals growth proceeds in a fractal way [5]. 
The growth mechanism in the second growth step from millimetre-sized pre- 
planetesimals to kilometre-sized planetesimals is not well constrained and subject 
of this work. Once a sufficient population of planetesimals exists, the third growth 
step to full-size planets proceeds due to gravity-aided accretion [15]. An overview 
of this process can be found in Ref. [9]. 

We turn to the second growth step from pre-planetesimals to planetesimals which 
is currently subject of extensive numerical and experimental effort. Two important 
obstacles could be identified in the pre-planetesimal regime: fragmentation and 
bouncing barrier. The most serious barrier is the fragmentation barrier: the relative 
velocities between pre-planetesimals increase as the size of the pre-planetesimals 
increases. The consequence is an increased probability for catastrophic disruption. 
Since experimental data on pre-planetesimal collisions are rather sparse, collision 
maps often have to contain simplistic assumptions such as a constant velocity 
threshold for fragmentation over several orders of magnitude in size [2, 16]. 
Transition velocities between parameter regions of positive and negative growth are 
often treated as independent from object porosities. 

The second obstacle to planetesimal formation is the bouncing barrier. In this 
scenario [16, 28], millimetre to centimetre-sized pre-planetesimals become more 
and more compacted during their collisional evolution. The relative velocities for 
these objects are still rather low. However, due to their compact structure the pre- 
planetesimals rebound instead of sticking to each other or destroying each other 
upon collision. Recent studies indicate that if growth is halted at smaller sizes than 
millimetre size, the bouncing barrier could even be beneficial to planetesimal growth 
if some objects grow out of this barrier and sweep up all the smaller particles [27]. 

The need for a detailed numerical investigation of pre-planetesimal collisions 
arose from the lack of experimental data under realistic protoplanetary disc condi¬ 
tions in this field. For this reason we developed a smoothed particle hydrodynamics 
(SPH) code to simulate porous pre-planetesimal material [9, 11] and calibrated 
it with laboratory benchmark experiments [17]. For a detailed classification of 
the collision outcome we developed the four-population model [10]. Numerical 
model and classification scheme were utilised to show that the stability of pre- 
planetesimals is significantly decreased if they possess an inhomogeneous structure 
[12, 13]. Additionally, preliminary studies showed that the conditions for rebound 
are drastically changed if pre-planetesimals feature a compacted outer shell [9,13]. 
We also begun with a large-scale parameter study of pre-planetesimal collisions 
in the decimetre regime which already indicated that size ratio and porosity of the 
collision partners have an influence on the collision outcome, which is not negligible 
[9,13]. 


Simulation of Pre-planetesimal Collisions with SPH II 


53 


In this article we continue this work. Initially, in Sect. 2, we briefly review the 
code and porosity model. Detailed quantitative investigations of size- and porosity- 
dependence of the fragmentation velocity threshold are presented in Sect. 3.1. 
The influence of the compactness of hard shells is studied in Sect. 3.2. Finally, 
in Sect. 3.3, analytical formulae for the stability of rotating dust aggregates are 
presented and counterchecked with simulations. Further areas of research are 
described in the outlook Sect. 4. 


2 SPH Code and Porosity Model 
2.1 Solid Body SPH 

For the simulation of pre-planetesimal collisions we utilise the numerical method 
smoothed particle hydrodynamics (SPH) [14,21] together with extensions for solid 
body mechanics [1,20,22] and a suitable porosity model [11,25]. 

As a method which was originally developed for pure hydrodynamics, SPH relies 
on the usual equations of continuum mechnics to ensure conservation of mass, 
momentum, and energy. Under external forces, stresses develop inside a solid body 
to restore its original shape. Hence, the main properties of solid body mechanics 
enter in the momentum equation via the stress tensor cr a p, which is defined as 


Ctafi — P&afi 4 " S a p , ( 1 ) 

where p and S Ul g denote the hydrostatic pressure and deviatoric stress tensor, 
respectively. Throughout this article Greek indices denote the spatial components 
and the Einstein summing notation is applied. The time evolution of the deviatoric 
stress tensor is computed according to Hooke’s law in frame invariant Jaumann rate 
form [1,9,11,13,24], The transition to plastic deformation due to shear is determined 
by the von Mises criterion 

S afl j- s afl' ( 2 ) 

where / = min [T 2 (</>)/3/2, l] and Y = Y(<p) is the shear strength. The quantity 
Ji = S a p Sup is the second irreducible invariant of the deviatoric stress tensor. Sub¬ 
grid porosity is modelled by the filling factor <p, which is defined as 

0 = — = 1 - 0 , (3) 

Ps 

where p, p s , and 0 are the density of the porous material, the density of the matrix 
material, and the porosity, respectively. The pre-planetesimal material is mono- 
disperse spherical SiC >2 dust [3,4] dust for which p s = 2,000kgm” 3 . 
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Fig. 1 Calibrated strength curves. The compressive strength £(cp), tensile strength T(<p), and 
shear strength Y(<p) as a result of the calibration process described in Ref. [11] 


The elastic as well as the plastic evolution of the hydrostatic part of the stress 
tensor are computed according to the filling-factor-dependent bulk modulus K(<j> ) 
as well as the compressive strength X"(c/>) and tensile strength T(<j>) [11-13], 
respectively. The bulk modulus scales with the filling factor by the following relation 

K(4>) = ( 4 ) 

V 0RBD / 

where y = 4 and K {] is the bulk modulus of an uncompressed random ballistic 
deposition (RBD) dust sample with cj >rbd = 0.15 [3]. This value was calibrated to 
be Kq = K(cf> RBD ) = 4.5kPa [11]. 

The strength quantities [11,17] are illustrated in Fig. 1 and represent transition 
thresholds from the elastic to the plastic regime. The tensile strength is given by a 
power-law 

T{<t>) = _io 2 - 8+L48 0 Pa . (5) 

The compressive strength takes the form 

/ i _ / \ A In 10 

1 Y'max Y^min . \ 

Z(<p) = p m [— ---1 , (6) 

V 0 max 0 J 

with = 0.58 and = 0.12, which denote maximum and minimum filling 
factor of the compressive strength relation. The quantity A In 10 is the power of the 
expression with A = 0.58. The constant p m = 260 Pa is its mean pressure. 

The shear strength is given by the geometric mean of Eqs. (5) and (6) 


Y(4>) = ^E(4>)\T(4>)\ . 


(7) 
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2.2 The Code Parasph 

The parasph code was developed by Hipp [19] and extended by Schafer for 
the simulation of ductile, brittle, and porous media [23,24], The porosity model 
was improved and calibrated by Geretshauser for pre-planetesimal material [9, 11] 
together with other extensions. The program is based on the parasph library [6,7], 
This is a set of routines developed for a easier and faster handling of parallel 
particle codes. By means of this library the physical problem and the parallel 
implementation are clearly separated, parasph features domain decomposition, 
load balancing, nearest neighbour search, and inter-node communication. Moreover, 
SPH enhancements such as additional artificial stress and XSPH were implemented. 
The adaptive Runge-Kutta-Cash-Karp integrator has been used for the simulations 
presented here. For testing purposes an adaptive second order Runge-Kutta, and an 
Euler integrator was added. HDF5 (Hierarchical Data Format) was included as a 
compressed input and output file format with increased accuracy, which decreases 
the amount of required storage space considerably. The parallel implementation 
utilises the Message Passing Interface (MPI) library. Test simulations yielded a 
speedup of 120 on 256 single core processors of a Cray T3E and of 60 on 128 
single core processors on a Beowulf-Cluster. 


3 Results 

The simulations in this section are carried out with 240,143-476,476SPH particles 
depending on the size of the projectile. The program parasph is run on the NEC 
Nehalem cluster of the HLRS. Depending on the size of the problem 32-80cores 
were used. The simulation time is strongly dependent on the collision velocity. In 
particular for fragmenting collisions the adaptive time step becomes as small as 
~ 10 -5 s throughout the whole simulation. In contrast in bouncing and sticking 
collisions, the time step is initially low and than increases for the rest of the 
simulation. To give a rough number, simulations take 72-240 h wall-clock time 
for 1 s of simulated time depending on size of the problem and involved physical 
process. 


3.1 Velocity Thresholds 

The main focus of the project lies on the investigation of the growth conditions for 
macroscopic pre-planetesimals. In this section we carry out simulations to constrain 
the influence of projectile size and aggregate porosity on these conditions for 
centimetre to decimetre-sized pre-planetesimals. Although the velocity threshold 
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Fig. 2 Fragmentation threshold velocity against projectile size. The different colours represent 
the filling factors <p = 0.15 (blue), cp = 0.25 (red), and tp = 0.35 (black). The arrows denote 
upper and lower limits. The target is a sphere of radius 10 cm. Due to the higher collision energy 
the threshold velocity for fragmentation decreases with increasing projectile size. Highly porous 
aggregates (<p = 0.15) are less stable than aggregates of intermediate porosity (<p = 0.35) 


for fragmentation might be significantly lowered if the pre-planetesimals possess an 
inhomogeneous filling factor distribution [10], we restrain to initially homogeneous 
aggregates as a simplification and reference case. 

Figure 2 illustrates the dependence of the fragmentation threshold velocity on the 
size of the impacting projectile. In each case the target is a porous dust sphere of 
radius 10 cm. Also the projectile has spherical shape. Both objects are homogeneous 
and of identical filling factor. We investigate three of the latter: 0 = 0.15,0 = 0.25, 
and 0 = 0.35. This represents a range between high and intermediate porosity. 
All three curves have a similar shape. Very small projectiles intrude into the target 
and get swallowed. Only few dust fragments are ejected. With increasing projectile 
size the impact energy increases and hence the threshold for disruption decreases. 
However, compared to the immense range of sizes in pre-planetesimal collisions 
ranging from millimetre to kilometre sizes the difference in threshold fragmentation 
velocity within the quite narrow investigated size range is remarkable. We conclude 
that even small differences in size have to be considered in the field of pre- 
planetesimal growth. Figure 2 already gives rise to another important dependence: 
the aggregate porosity. 

Figure 3 shows a surprising dependence of the threshold velocity on the 
aggregate porosity. As a standard setup for this study we collide two homogeneous 
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Fig. 3 Threshold velocity for fragmentation against filling factor. The target and projectile are 
spheres of radius 10 cm and 6 cm, respectively. The arrows denote upper and lower limits. Starting 
from low filling factors the threshold velocity increases because the strength of the aggregate 
increases. At a filling factor of <p ~ 0.37 a sharp drop occurs 


spherical aggregates with vanishing impact parameter. The target and projectile 
radii are 10 cm and 6 cm, respectively. At low filling factors or, equivalently, high 
porosity, the tensile and shear strength are low (see Eqs. (5), (7), and Fig. 1). As a 
consequence, such pre-planetesimals are rather fragile. However, the compressive 
strength is also rather low (Eq. (6) and Fig. 1) and hence energy can easily be dissi¬ 
pated by plastic deformation since the pressure threshold for plastic compression is 
exceeded in each collision. 

As the filling factor increases the compressive strength, shear strength, and 
tensile strength increase. On the one hand the plastic deformability of the aggregate 
and hence the ability to dissipate energy hereby decrease. On the other hand, 
however, the aggregates become less fragile. Overall the aggregates gain a higher 
stability and the threshold velocity further increases for higher filling factors. 

At a filling factor of 0 ~ 0.35 a sudden drop in the threshold velocity occurs. 
This drop is surprising and is caused by a complex interplay between the elastic and 
plastic properties of the aggregates which makes it difficult to capture this feature 
quantitatively. This issue is still object of ongoing research but the influence of 
the following aspects could be identified. (1) With increasing compressive strength 
.E(0) an increasing fraction of the initial kinetic energy can be stored in elastic 
loading of the aggregates (see also Sect. 3.2). As a consequence, this elastic energy 
is available for the separation of the aggregates in a bouncing collision. (2) With 
increasing 0 also the bulk modulus of the aggregates increases (Eq. (4)) and the 
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aggregates become stiffer. As a result, the contact area between the aggregates 
decreases and less energy is necessary to separate the aggregates in a bouncing 
collision. (3) Because of partial sticking of the aggregates, a clean growth-neutral 
bouncing is unrealistic. Instead some mass transfer or even the rip out of larger 
chunks can be expected. (4) Due to the increase in £(<p) and K(<p) density waves of 
increasing amplitude propagate across the aggregate and lead to a local rarefaction 
of the material and consequential local reduction of shear and tensile strength 
(see also a similar phenomenon in Sect. 3.3). These density waves finally rip the 
aggregate apart even at low collision velocities. 

The aspects (1) and (2) decrease the probability for sticking of the projectile 
to the target and increase the probability of bouncing of the targets. Aspect (4) 
increases the probability for fragmentation of the aggregate. The collisions at the 
drop of the threshold velocity in Fig. 3 show a behaviour where the aggregates 
mainly bounce but partly also fragment or some mass transfer (aspect 3) occurs. 
Since sticking and consequential growth and bouncing (with some fragmentation 
or mass transfer), resulting in neutral or negative growth, are two clearly distinct 
physical processes, a sharp drop in the describing threshold velocity can be expected 
which at least qualitatively explains the shape of the curve in Fig. 3. We note that a 
collision model based on experimental data [16] which features only the categories 
“porous” and “compact” aggregates uses a filling factor of cj> = 0.40 as separation 
criterion which is close to the drop filling factor of (f> = 0.37. Other experimental 
investigations [26] find that the maximum filling factor that can be reached for 
porous aggregates in protoplanetary discs is roughly 0 ~ 0.33. However, this value 
is obtained for polydisperse Si02 dust which is less compressible. This value can 
be expected to be higher for our monodisperse dust. To summarise, the drop in our 
threshold velocity curve occurs where also experimental data indicate a significant 
change in the collision behaviour of dust aggregates. However, the reasons for this 
drop require a closer investigation. 


3.2 Bouncing, Hard Shells and Porosity 

In our first study we investigate the influence of porosity on the bouncing and 
sticking behaviour. This is to assess whether the experiments with intermediate 
porosity [2] and high porosity [ 1 8] can be combined into the collision map presented 
by Guttler et al. [16]. 

For this, we conduct simulations of collisions between a homogeneous target 
and projectile. We distinguish three cases where both objects feature a uniform 
filling factor of cf> = 0.15, <p = 0.35, and (f> = 0.55. According to the compressive 
strength relation this is nearly equivalent to maximum, intermediate, and minimum 
porosity. The aggregates thus feature very low, intermediate, and very high pressure 
thresholds for plastic deformation. Resulting from this property we expect a 
strong dependence of the filling factor on the sticking and bouncing behavior. The 
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Fig. 4 Cross-section through the outcome of collision simulations with homogeneous aggregates 
for high (<j >j = 0.15, left), intermediate (cj>, = 0.35, middle) and low porosity (<p\ = 0.55, right). 
The projectile radius is equal to the target radius. The collision velocities are in the first line (a, e, i) 
0.1, in the second line (h, f j) 0.3, in the third line (c, g, k) 0.5, and in the fourth line (d, h, l) 
1.0 ms -1 . The highly porous dust aggregates stick for each of the simulated collision velocities. 
The filling factor is increased in the vicinity of the impact site. Additionally the aggregates are 
deformed to almost kidney shape for 1.0 ms -1 ( d ). In contrast, the homogeneous aggregates with 
fa = 0.35 exclusively rebound. The region of increased filling factor in the interior increases 
with collision velocity. The impact site is flattened. The aggregates with low filling factor also 
exclusively rebound at the given velocities. However, no significant compaction or deformation is 
visible except some fragmentation for the highest collision velocity (/) 



impacting projectile has a radius r p = 10 cm just like the target. The impact velocity 
vo is 0.1, 0.3, 0.5, and 1.0ms -1 . 

The theoretical demarcation between sticking and bouncing in Ref. [16] makes 
the following assumptions: (1) elastic deformation of the aggregates, and (2) the 
filling factor in the contact region is not changed in the collision. These assumptions 
are too simplistic. 

1. The assumption of elastic deformation of the aggregates is only valid for filling 
factors close to the maximum filling factor. In particular for highly porous 
aggregates the deformation is highly plastic and as a consequence the contact 
area between the aggregates is increased compared to elastic contact (see e.g. 
Fig. 4d). 
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Fig. 5 Cross-section through collision results involving aggregates with a hard shell of 0.1 r. 
The interior of all aggregates is highly porous <p z = 0.15, which would lead to perfect sticking if the 
aggregates were homogeneous. However, in the left (a-d), middle {e-h), and right column (i-l) the 
filling factors of the hard shell are 0.35, 0.45, and 0.55, respectively. Initially, both aggregates are 
spheres of 10 cm radius. The displayed cross-sections show the situation after the impact with 0.1 
(a, e, 0, 0.3 (b, f, g), 0.5 ( c, g, k), and 1.0 ms -1 ( d, h, /). Even thin hard shells can lead to bouncing 
if the filling factor is sufficiently high 


2. As it can clearly be seen in Fig. 4, the filling factor is highly increased in the 
contact area. This leads to an increased tensile strength in this region (Eq. (5)). 
An increased tensile strength, however, also increases the contact energy, which 
promotes sticking. 

We conclude, that because of these effects in particular for highly porous dust 
aggregates the threshold velocity for sticking is much larger than presented in [16]. 
Consequently, the parameter space where bouncing occurs is much smaller than 
assumed in this reference and sticking dominates for low velocities and low filling 
factors. 
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We use the same setup as above regarding ratio of target and projectile radius and 
collision velocity, but we add a hard outer shell (Fig. 5). The thickness of the hard 
shell is given as a fixed fraction of target and projectile radii, respectively, i.e. 0.1, 
0.2, 0.3, 0.4. The core of all aggregates has a filling factor (p c = 0.15, whereas the 
filling factor of the hard shell cp h takes the values 0.35, 0.45, and 0.55. 

From the comparison with aggregates without hard shell, which exclusively 
resulted in sticking (cf. Fig. 4, first row), it is evident that hard shells do have 
an influence on the bouncing behaviour of dust aggregates. This is because for 
aggregates with (p = 0.15 the compressive strength is very low. As a consequence, 
nearly the total kinetic energy of the impact is dissipated by plastic deformation and 
nearly no elastic loading of the colliding objects is possible. 

Conversely, for aggregates with hard shells ((ph = 0.35) the plastic deformation 
threshold is higher for the shell. During the impact, the shell is elastically loaded and 
the aggregates rebound. However, in the immediate area around the impact site the 
deformation threshold for the shell is exceeded and plastic deformation takes place 
in the hard shell. Therefore, the tensile T(<p) and shear YUp) strengths are increased 
in this region and counteract the bouncing. However, hard shells of intermediate 
porosity are still sufficiently plastically deformable such that bouncing is still a rare 
event in the investigated parameter space. 

The situation changes for hard shell filling factors of (p\i = 0.45 and 0.55. The 
value of </>!, = 0.55 is close to maximum compaction. Thus, the hard shell rather 
breaks than being plastically deformed. As a consequence of the lacking plastic 
deformability the collisions of aggregates with denser hard shells exclusively result 
in bouncing and partial fragmentation (of the shell). 

We conclude that hard shells of intermediate porosity in general do not prevent 
sticking. Only in a few cases, where the hard shell was thick enough for sufficient 
elastic loading or at the right velocity for a thin hard shell bouncing occurs. 
However, if the hard shells become more compact, even thin shells suffice to yield 
bouncing collisions. For very compact hard shells and higher collision velocities 
partial fragmentation of mostly shell material can be expected. 


3.3 Stability of Rotating Dust Aggregates 

Particularly in aggregate collisions where the impact parameter is not vanishing, the 
aggregates or fragments start to rotate [24]. However, high spinning rates may lead 
to additional disruption of the aggregates since tidal forces may tear them apart. 
To assess this effect for pre-planetesimal collisions, we analytically derive critical 
angular velocities for a long rotating dust cylinder depending on its filling factor 
and radius. Additionally we carry out simulations of rotating cylinders with the 
same specifications. As a consistency check of the SPH code we compare the stress 
evolution inside the cylinder with the theoretical reference. We also compare the 
simulated critical angular velocity with the expected value. 
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Table 1 Results of the rotating cylinder study. The cylinders have an initial filling factor cpi. The 
rotation is accelerated by co. The onset of fragmentation starts at a critical time t c and a critical 
angular velocity co c . Simulated and theoretical value of the latter are compared 


0i 

w[s 1 ] 

tc[s] 

a> c [s 1 ] simulation 

a> c [s] theory 

% 

0.15 

1 

9.95 

9.64 

9.52 

101.26 

0.25 

1 

10.2 

9.86 

14.3 

68.95 

0.35 

0.2 

31.25 

6.08 

17.98 

33.82 

0.45 

0.2 

28.4 

5.50 

23.56 

23.34 

0.55 

0.2 

27.05 

5.20 

34.36 

15.13 


According to our porosity model presented in Sect. 2 disruption can occur when 
the stress inside the cylinder represented by the full stress tensor a a p exceeds 
either the tensile strength T{<j>) or the shear strength F (</;). For each angular velocity 
the stress state is well defined [8]. From this one can derive two critical angular 
velocities <»l and coj. The first arises from the tensile strength criterion and reads: 




-12(1 - v) 


Po r o( 3 • 


v) 


T(4>), 


( 8 ) 


where ro and v are the radius of the cylinder and the Poisson ratio, respectively. 
Fragmentation due to exceeding the shear strength occurs when the following 
critical angular velocity is reached 


0) 


Y 

C 


I 8(1 — v) 

Po>o(3-4v) 


Y(cP). 


(9) 


The cylinder fragments at the minimum of one of these velocities. The theoretical 
values for the fragmentation criterion can be found in Table 1 for the investigated 
initial filling factors </>;. 

To check our theoretical predictions, we carry out simulations of a rotating 
cylinder consisting of porous SiCF dust. The radius of the cylinder is ro = 10 cm 
and its height is h = 60 cm. It rotates about the z-axis which represents the central 
axis of the cylinder and is accelerated by an constant angular acceleration d). The 
exact values are given in Table 1 . 

Initially the cylinder deforms elastically and the tensions inside the body 
increase. We measure the principal stresses of (j a p along a line from (0,0,0) which 
represents the centre of mass of the cylinder radially outwards along the x-axis. The 
resulting principal stresses in cylinder coordinates ao„ and cr zz are displayed 
in Fig. 6 for t = 10 s and t = 25 s, shortly before the cylinder fragments. During 
its whole elastic evolution the stresses match the theoretical predictions extremely 
well. This confirms the validity of our code in the elastic regime. 
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Fig. 6 Stresses cr a p of the stable rotating cylinder with 0; = 0.55. The principal stresses in 
cylinder coordinates (top), a n (middle), and ct zz are measured along a line from the centre 
of the cylinder radially outwards at two different times t = 10 s (left) and t = 25 s (right). In this 
elastic regime the simulated stresses (black) almost perfectly follow the curves expected from the 
theory (colored). The radius of the cylinder is 10 cm, however due to SPH smoothing this boundary 
is exceeded in the graphs. At the rim the stresses start to deviate from the expected values 


With respect to fragmentation, the simulation for the cylinder with </>, = 0.15 
reproduces the critical angular velocity very well (see Table 1). For higher initial 
filling factors however, the predicted critical angular velocity is much larger than the 
simulated value. The discrepancy increases with <p[. The reason for this is illustrated 
in Fig. 7 which displays the evolution of the density inside the cylinder. As the 
body rotates density waves develop inside. These lead to a local rarefaction of the 
material. As a consequence the tensile and shear strength is diminished in these 
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t=25.00 s t=26.60 s 



t=27.00 s 



x 


t=27.40 s 



x 


Fig. 7 Density evolution of a rotating cylinder with <f>\ = 0.55 after the stable phase. The images 
show a cut along the x-z plane through the centre of the cylinder which is rotating about the z-axis. 
The density is colour-coded. While at t = 25 s the density distribution is still homogeneous, at 
t = 26 s density fluctuations at the top and bottom of the cylinder start to develop. Together with 
the density the tensile and shear strength are locally diminished. These regions are weak spots at 
which rupture starts (t = 21 s and t = 27.4 s) 


regions (Fig. 8). However, the stresses increase with increasing angular velocity. 
The rarefied regions serve as weak spots from which rupture can develop due to 
material failure which arises from a stress exceeding the strength values. 


4 Outlook 

The presented results encourage further studies for a deeper insight into the outcome 
of pre-planetesimal collisions. First of all, the sudden drop in fragmentation thresh¬ 
old velocity at a given porosity requires further investigations. More simulations 
have to be carried out to assess other dependencies of this behaviour, for example 
the projectile size. A high resolution in time is desirable at this transition point 
to understand the complex interplay between elastic and plastic behaviour. With 
respect to the rotating cylinder the development of the elastic waves should be 
studied in more detail. 

Recently, new measurements for another possible pre-planetesimal material 
became available. This demands for an extensive parameter study using this new 
material to explore possible differences with respect to growth conditions. Off- 
centre collisions of pre-planetesimals are still a desiderate because they represent 
the more realistic situation in the disc. 
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t=26.60 s 
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Fig. 8 Local shear strength Y(<p) (red) and von Mises stress ^/3.A (black) for r > 0 at z = 0.2 
for a rotating cylinder with 0; = 0.55. According to the von Mises yielding criterion the cylinder 
starts to fragment once the von Mises stress exceeds the shear strength. In the stable regime (t = 
20 s) Y(cp) is two magnitudes larger than the stresses. However, density fluctuations lead also to 
fluctuations in Y(cj>) which are already visible at t = 26 s and t = 27 s. Eventually, the shear 
strength locally falls below the von Mises stress at t = 27.15 s and r ~ I cm and rupture sets in 


To gain an idea of what realistic processed pre-planetesimals look like, it is 
desirable to study the collision history of an aggregate. For this, a homogeneous 
aggregate is subsequently collided with other aggregates of different sizes and filling 
factor at various collision velocities. The input information for a realistic collision 
sequence can be derived from global coagulation simulations. 
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Abstract The theory for the formation of the first population of stars (Pop. Ill) 
predicts an initial mass function (IMF) dominated by high-mass stars, in contrast 
to the present-day IMF, which tends to yield mostly stars with masses less than 
1 Mq ■ The leading theory for the transition in the characteristic stellar mass predicts 
that the cause is the extra cooling provided by increasing metallicity. In particular, 
dust can overtake H 2 as the leading coolant at very high densities. The aim of 
this work is to determine the influence of dust cooling on the fragmentation of 
very low metallicity gas. To investigate this, we make use of high-resolution 
hydrodynamic simulations with sink particles to replace contracting protostars, and 
analyze the collapse and further fragmentation of star-forming clouds. We follow 
the thermodynamic response of the gas by solving the full thermal energy equation, 
and also track the behavior of the dust temperature and the chemical evolution of 
the gas. We model four clouds with different metallicities (10 4 , 10 _s , 10 - 6 z 0 , 
and 0), and determine the properties of each cloud at the point at which it undergoes 
gravitational fragmentation. We find evidence for fragmentation in all four cases, 
and hence conclude that there is no critical metallicity below which fragmentation 
is impossible. Nevertheless, there is a clear change in the behavior of the clouds at 
Z = 10 - s z Q , caused by the fact that at this metallicity, fragmentation takes longer 
to occur than accretion, leading to a flat mass function at lower metallicities. 
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1 How Are the Stars Formed in the Early Universe? 

The first stars to form during the earliest stages of the the Universe were thought 
to give rise to massive stars, the so-called Population III (Pop. Ill), with numerical 
simulations predicting masses in the range 20-150Mg [e.g. 1,6,28,41]. However, 
recent results show that lower mass stars can also be formed, albeit with charac¬ 
teristic masses above the solar value [11, 12, 17, 18, 36, 38]. This contrasts with 
present-day star formation, which yields typically stars with masses less than 1 Mg 
[9,24], and so at some point in the evolution of the Universe there must have been 
a transition from primordial (Pop. Ill) star formation to the mode of star formation 
we see today (Pop. II/I). 

Metal line cooling and dust cooling are effective at lower temperatures and larger 
densities, and so it has been proposed that metal enrichment of the interstellar 
medium by previous generations of stars causes the transition from Pop. Ill to 
Pop. II. This suggests that there might be a critical metallicity Z cn , at which the 
mode of star formation changes. 

The main coolants that have been studied in the literature are C II and O I fine 
structure emission [7,8, 15,21,23,29,34,35], and dust emission [e.g. 26,27,30,32, 
33]. Carbon and oxygen are identified as the key species because in the temperature 
and density conditions that characterize the early phases of Pop. Ill star formation, 
the O I and C II fine-structure lines dominate over all other metal line transitions 
[20]. By equating the C II or O I fine structure cooling rate to the compressional 
heating rate due to free-fall collapse, one can define critical abundances [C/H] = 
—3.5 and [O/H] = —3.0 1 for efficient metal line cooling [8]. 

A more promising way to form low mass Pop. II stars involves dust cooling. 
Dust cooling models [e.g. 13,26,27,32,33] predict a much lower critical metallicity 
(Z cr it R; 1 0 4 — 1 O ' 6 Zg), with most of the uncertainty coming from the nature of 
the dust in high-redshift galaxies. 

At densities n > 10 11 cm -3 dust cooling becomes efficient [26], since inelastic 
gas-grain collisions are more frequent [19]. This cooling enhances fragmentation, 
and since it occurs at high densities, the distances between fragments can be 
very small [25, 27, 30, 31, 33]. In this regime, interactions between fragments 
will be common, and analytic models of fragmentation are unable to predict the 
mass distribution of the fragments. A full 3D numerical treatment, following the 
fragments, is needed. 

In [13], we improved upon these previous treatments by solving the full 
thermal energy equation, and calculating the dust temperature through the energy 
equilibrium equation. We assumed that the only significant external heat source is 
the cosmic microwave background (CMB), and included its effects in the calculation 
of the dust temperature. We found that model clouds with metallicities as low as 


*[X/Y] = log 10 (Nx/NY)t — log 10 (Nx/NY)o> f° r elements X and Y, where * denotes the gas in 
question, and where Nx and Ny are the mass fractions of the elements X and Y. 
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1O~ 4 Z 0 or 10' - 5 z 0 do indeed show evidence for dust cooling and fragmentation, 
supporting the predictions of [39,40] and [10]. 

In this work, we simulate the evolution of star-forming clouds for a wider range 
of metallicities (10 -4 , 10 -s , 10~ 6 Z Q , and 0), and study the effect that this has on 
the mass function of the fragments that form. We also investigate how properties 
such as cooling and heating rates, and number of Bonnor-Ebert masses [5,14] of 
the fragmenting clouds vary with metallicity and whether there is any systematic 
change in behavior with increasing metallicity. 


2 Simulations 

2.1 Numerical Method 

We model the collapse of a low-metallicity gas cloud using a modified version 
of the Gadget-2 [37] smoothed particle hydrodynamics (SPH) code. To enable us 
to continue our simulation beyond the formation of the first very high density 
protostellar core, we use a sink particle approach [3, 22], in the same way as in 
[13]. Sink particles are created once the SPH particles are bound, collapsing, and 
within an accretion radius, li acc , which we take to be 1.0 AU. The threshold number 
density for sink particle creation is 5.0 x 10 13 cm -3 . At the threshold density, the 
Jeans length at the minimum temperature reached by the gas is approximately one 
AU, while at higher densities the gas becomes optically thick and begins to heat 
up. Further fragmentation on scales smaller than the sink particle scale is therefore 
unlikely to occur. For further discussion of the details of our sink particle treatment, 
we refer the reader to [11]. 

For the metallicities and dust-to-gas ratios considered in this study, the dominant 
sources of cooling are the standard primordial coolants (H 2 bound-bound emission 
and collision-induced emission) and energy transfer from the gas to the dust. 
Collisions between gas particles and dust grains can transfer energy from the gas 
to the dust (if the gas temperature T is greater than the dust temperature T gr ), or 
from the dust to the gas (if 77, > T). Full details of the dust cooling treatment can 
be found in [13]. 


3 Code Performance 

Each of the simulations need approximately 130kCPU-h to be concluded. In order 
to run them in a reasonable time, we need to make use of 256 or 512 CPUs. A 
minimum of 40 million SPH particles are required for resolving the fragment mass 
and correct predict the mass distribution of the protostellar cores. 
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Table 1 Parallel efficiency for different number of SPH particles in 
simulations run 


CPUs 

Particles 




10 5 

10 6 

10 7 

4- 10 7 

8 

1.00 




16 

0.82 




32 

0.52 

1.00 



64 

0.26 

0.91 

1.00 

1.00 

128 


0.44 

0.92 

0.95 

256 



0.50 

0.75 

512 




0.52 


We also run simulations to evaluate the parallel efficiency using different number 
of CPUs and SPH particles. The time is normalized for the minimum number of 
CPUs, which is related to the number of SPH particles in the simulation. The 
simulation was run until a predicted result was achieved, without hie writing, so 
that we could measure just the time needed for the CPUs to calculate the evolution 
of the gas and dust. The result is shown in Table 1 . 

In addition, to cover the parameter space, we run four simulations, resulting in 
0.55 Mio. CPU-h. For resolution studies and post-processing, we had an additional 
overall cost of approximately 0.15 Mio. CPU-h. In total, this project required 
approximately 0.7 Mio. CPU-h 


4 Computational Challenge 

We performed a set of four simulations, with metallicities Z/Z Q = 1CT 4 , 10 -5 , 
10 -6 , and the metal-free case. Each simulation used 40 million SPH particles. We 
used these simulations to model the collapse of an initially uniform gas cloud with 
an initial number density of 10 s cm* 1 and an initial temperature of 300 K. The 
cloud mass was 1,000 Mq . We included small amounts of turbulent and rotational 
energy, with \\ = 0.1 and /l = £ ro t/|£gravl = 0.02, where E gm is the 
gravitational potential energy, E^ is the turbulent kinetic energy, and E mt is the 
rotational energy. The mass resolution is 2.5 x 10 -3 Mg, which corresponds to 100 
times the SPH particle mass [see e.g. 4], The redshift chosen was z = 15, when the 
cosmic microwave background temperature was 43.6 K. The dust properties were 
taken from [16], and the dust grain opacities were calculated in the same fashion 
as in [2], In the calculations, the opacities vary linearly with Z, which means for 
instance that for the Z/Z Q = 10~ 4 calculations, the opacities were 10 4 times the 
original values. 
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5 Scientific Results 

5.1 Thermodynamical Evolution of Gas and Dust 

Dust cooling is a consequence of inelastic gas-grain collisions, and thus the energy 
transfer from gas to dust vanishes when they have the same temperature. We 
therefore expect the cooling to cease when the dust reaches the gas temperature. In 
order to evaluate the effect of dust on the thermodynamic evolution of the gas and 
verify this assumption, we plot in Fig. 1 , the temperature and density for the various 
metallicities tested. We compare the evolution of the dust and gas temperatures 
in the simulations, at the point of time just before the formation of the first sink 
particle (see Table 2). The dust temperature (shown in blue) varies from the CMB 
temperature in the low density region to the gas temperature (shown in red) at much 
higher densities. 

The efficiency of the cooling is also expressed in the temperature drop at 
high densities. The gas temperature decreases to roughly 400 K in the 10 _5 Zq 
simulation, and 200 K in the Z = 10 -4 Zq case. This temperature drop significantly 
increases the number of Jeans masses present in the collapsing region, making the 
gas unstable to fragmentation. The dust and the gas temperatures couple for high 
densities, when the compressional heating starts to dominate again over the dust 
cooling. The subsequent evolution of the gas is close to adiabatic. 


5.2 Fragmentation 

The transport of angular momentum to smaller scales during the collapse leads to 
the formation of a dense disk-like structure, supported by rotation. This disk then 
fragments into multiple objects. 

Figure 2 shows the density structure of the gas immediately before the formation 
of the first protostar. The top-left panel shows a density slice on a scale comparable 
to the size of the initial gas distribution. The structure is very filamentary and 
there are two main over-dense chimps in the center. If we zoom in on one of 
the clumps, we see that its internal structure is also filamentary. Observe that at 
large scales the gas cloud properties are the same for all metallicities. Differences 
in the thermodynamic evolution appear only at n > 10 11 cnT 4 (see Fig. 1). As a 
consequence, we observe variations in the cloud structure only in the high-density 
regions. 

For Z = 10 -6 Zq and 0, the formation of spiral structures is not observed. In 
these two runs, star formation occurs mainly in the central clump. 

At the beginning of the simulation, the cloud had ~3 Mbe- During the collapse, 
the gas cools and reaches ~6 Mbe in all cases. Cooling and heating are different 
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Fig. 1 Dependence of gas 
and dust temperatures on gas 
density for metallicities 10 4 , 
10 5 , and 10~ 6 and zero 
times the solar value, 
calculated just before the first 
sink particle was formed (see 
Table 2). In red , we show the 
gas temperature, and in blue 
the dust temperature. The 
dashed lines are lines of 
constant Jeans mass 



10 4 10 6 10 8 10 '° 1 o ’ 2 10’ 4 


number density (cm 3 ) 


depending on the metallicity, and this difference is seen for distances smaller than 
~400AU. The Z = 10 4 Zq case, for instance, has twice the number of Mbe for 
distances smaller than ~10 AU, when compared to the other cases. This will have 
direct consequences for the fragment mass function as we will see in the following 
section (Fig. 3). 
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Table 2 Sink particle properties for the different metallicities at the point where 4.7 Mq have 
been accreted by the sink particles. “ST” (start time) is the time when sink particles start to form. 
“FT” (formation time), is the time taken to accrete 4.7 Mq in the sinks. “SFR” is the mean star 
formation rate. Mean and median refer to the final mean and median sink mass. Finally, “N” is the 
number of sink particles formed 


Z/Zq 

ST 

(10 3 year) 

FT 

(year) 

SFR 

(Mg/year) 

Mean 

(Mq) 

Median 

(Mq) 

N 

0 

171.6 

73 

0.064 

0.24 

0.12 

19 

icr 6 

171.2 

72 

0.065 

0.29 

0.06 

16 

10“ 5 

170.8 

88 

0.053 

0.24 

0.11 

19 

itr 4 

169.2 

138 

0.034 

0.10 

0.05 

45 





Fig. 2 Number density maps for a slice through the high density region for Z = 10~ 4 Zq {top), 
10 —5 Zq, 10~ 6 Zq, and 0 {bottom). The image shows a sequence of zooms in on the density 
structure in the gas immediately before the formation of the first protostar 


5.3 Properties of the Fragments 

The simulations were stopped at a point when 4.7 Mq of gas has been accreted 
into the sink particles, because the high computational cost made it impractical 
to continue. Figure 4 shows the mass distribution of sink particles at that time. 
We typically find sink masses below 1 Mq , with somewhat smaller values in the 



















76 


G. Dopcke et al. 



Fig. 3 Number density map showing a slice through the densest clump, and the star formation 
efficiency (SFE) for Z=10~ 4 Zq (bottom), 10~ 5 Zq, 10~ 6 Zq, and 0 (top). The box is 
100 AU x 100 AU and the percentage indicates the star formation efficiency, i.e. the total mass 
in the sinks divided by the cloud mass (1,000 Mg) 


10 4 Zq case compared to the other cases. No sharp transition in fragmentation 
behavior was found, but rather a smooth and complex interaction between kinematic 
and thermodynamic properties of the cloud. 

Table 2 lists the main sink particle properties. It shows that the time taken to form 
the first sink particle is slightly shorter for higher metallicities. This shorter time is 
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Fig. 4 Sink particle mass 
function at the point when 
4.7 Mg of gas had been 
accreted by the sink particles. 
The mass resolution of the 
simulations is indicated by 
the vertical line 
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a consequence of the more efficient cooling by dust, which decreases the thermal 
energy that was delaying the gravitational collapse. In Table 2 we also observe that 
the star formation rate is lower for Z = 10 - 4 z q . This is because star formation 
started at an earlier stage of the collapse, when the mean density of the cloud was 
lower and there was less dense gas available to form stars. 

To better understand whether the resulting stellar cluster was affected by varying 
the metallicity, we plot the final sink mass distribution in Fig. 4. It shows that for 
the simulations with Z < 10 - 5 z 0 , the resulting sink particle mass function is 
relatively flat. There are roughly equal numbers of low-mass and high-mass stars, 
implying that most of the mass is to be found in the high-mass objects. This mass 
function is consistent with those found in other recent studies of fragmentation in 
metal-free gas [18,36], If the sink particle mass function provides a reliable guide 
to the form of the final stellar IMF, it suggests that at these metallicities, the IMF 
will be dominated by high-mass stars. 

All of the histograms in Fig. 4 have the lowest sink particle mass well above the 
resolution limit of 0.0025 Mg. Note that in all cases, we are still looking at the very 
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early stages of star cluster evolution. As a consequence, the sink particle masses in 
Fig. 4 are not the same as the final protostellar masses - there are many mechanisms 
that will affect the mass function, such as continuing accretion, mergers between the 
newly formed protostars, feedback from winds, jets and luminosity accretion, etc. 


6 Conclusions 

In this paper we have addressed the question of whether dust cooling can lead 
to the fragmentation of low-metallicity star-forming clouds. For this purpose we 
performed numerical simulations that follow the thermodynamical and chemical 
evolution of collapsing clouds. The chemical model included a primordial chemical 
network together with a description of dust effects, where the dust temperature was 
calculated by solving self-consistently the thermal energy equilibrium equation. 

As a result, we found that dust can cool the gas, for number densities higher 
than 10 11 , 10 12 , and 3 x 10 13 cnT 3 for Z = 10~ 4 , 10~ 5 , and 10 _6 Zq, respectively. 
Higher metallicity implies larger dust-to-gas fraction, and consequently stronger 
cooling. This is reflected in a lower temperature of the dense gas for the higher 
metallicity simulations, and this colder gas permitted a faster collapse. Therefore, 
the fragmentation behavior of the gas depends on the metallicity, and higher 
metallicities lead to a faster collapse. 

For example, the characteristic fragment mass was lower for Z = 1 0 -4 Zq, 
since a lower temperature reduces the Bonnor-Ebert masses at the point where the 
gas undergo fragmentation. This also implies a lower ratio of fragmentation and 
accretion time, ff rag /I a cc> which will lead to a mass function dominated by low-mass 
objects. For Z < 10~ 5 Zq, fragmentation and accretion timescales are comparable, 
and the resulting mass spectrum is rather flat, with roughly equal numbers of stars 
in each mass bin. 

In addition to that, dust cooling appears to be insufficient to change the stellar 
mass distribution for the Z = I (P 5 and 10 6 Zq cases, when compared with the 
metal-free case. This can be seen in the sink particle mass function (Fig. 4), which 
shows that the Z < 1 0 -5 Z Q cases do not appear to be fundamentally different. 

Finally, we conclude that the dust is not an efficient coolant at metallicities below 
or equal to Z cn , = 10 _5 Z Q , in the sense that it can not change the fragmentation 
behavior for these metallicities. Our results support the idea that low mass fragments 
can form in the absence of metals, and clouds with Z < Z cr i t will form a cluster with 
a flat IMF. 
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The SuperN-Project: Porting and Optimizing 
VERTEX-PROMETHEUS on the Cray XE6 
at HLRS for Three-Dimensional Simulations 
of Core-Collapse Supernova Explosions 
of Massive Stars 


F. Hanke, A. Marek, B. Muller, and H.-Th. Janka 


Abstract Supernova explosions are among the most powerful cosmic events, 
whose physical mechanism and consequences are still incompletely understood. 
We have developed a fully MPI-OpenMP parallelized version of our VERTEX- 
PROMETHEUS code in order to perform three-dimensional simulations of stellar 
core-collapse and explosion on Tier-0 systems such as Hermit at HLRS. Tests on 
up to 64,000 cores have shown excellent scaling behavior. In this report we present 
the system of equations and the algorithm for its solution that are employed in our 
code VERTEX-PROMETHEUS. We also discuss the parallelization of VERTEX- 
PROMETHEUS and present our progress in porting, optimizing, and performing 
production runs on a large variety of machines, starting from vector machines and 
reaching to modern systems. In particular the results of our efforts to achieve good 
parallel scaling on the new Cray XE6 at HLRS Stuttgart are highlighted. 


1 Introduction 

A star more massive than about eight solar masses ends its life in a catastrophic 
explosion, a supernova. Its quiescent evolution comes to an end, when the pressure 
in its inner layers is no longer able to balance the inward pull of gravity. Throughout 
its life, the star sustained this balance by generating energy through a sequence 
of nuclear fusion reactions, forming increasingly heavier elements in its core. 
However, when the core consists mainly of iron-group nuclei, central energy 
generation ceases. The fusion reactions producing iron-group nuclei relocate to the 
core’s surface, and their “ashes” continuously increase the core’s mass. Similar 
to a white dwarf, such a core is stabilised against gravity by the pressure of its 
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degenerate gas of electrons. However, to remain stable, its mass must stay smaller 
than (roughly) the Chandrasekhar limit. When the core grows larger than this limit, it 
collapses to a neutron star, and a huge amount (~ 10 53 erg) of gravitational binding 
energy is set free. Most (~ 99 %) of this energy is radiated away in neutrinos, but 
a small fraction is transferred to the outer stellar layers and drives the violent mass 
ejection, which disrupts the star in a supernova. 

Despite 40 years of research, the details of how this energy transfer happens and 
how the explosion is initiated are still not well understood. Observational evidence 
about the physical processes deep inside the collapsing star is sparse and almost 
exclusively indirect. The only direct observational access is via measurements 
of neutrinos or gravitational waves. To obtain insight into the events in the 
core, one must therefore heavily rely on sophisticated numerical simulations. The 
enormous amount of computer power required for this purpose has led to the use of 
several, often questionable, approximations and numerous ambiguous results in the 
past. Fortunately, however, the development of numerical tools and computational 
resources has meanwhile advanced to a point, where it is becoming possible to 
perform multi-dimensional simulations with unprecedented accuracy. Therefore 
there is hope that the physical processes which are essential for the explosion can 
finally be unravelled. 

An understanding of the explosion mechanism is required to answer many 
important questions of nuclear, gravitational, and astro-physics like the following: 

• How do the explosion energy, the explosion timescale, and the mass of the 
compact remnant depend on the progenitor’s mass? Is the explosion mechanism 
the same for all progenitors? For which stars are black holes left behind as 
compact remnants instead of neutron stars? 

• What is the role of the - incompletely known - equation of state (EoS) of the 
proto-neutron star? Do softer or stiffer EoSs favour the explosion of a core 
collapse supernova? 

• How do neutron stars receive their natal kicks? Are they accelerated by asym¬ 
metric mass ejection and/or anisotropic neutrino emission? 

• What are the generic properties of the neutrino emission and of the gravitational 
wave signal that are produced during stellar core collapse and explosion? Up 
to which distances could these signals be measured with operating or planned 
detectors on earth and in space? And what can one learn about supernova 
dynamics or nuclear and particle physics from a future measurement of such 
signals in the case of a Galactic supernova? 

• How do supernovae contribute to the enrichment of the intergalactic medium 
with heavy elements? What kind of nucleosynthesis processes occur during and 
after the explosion? Can the elemental composition of supernova remnants be 
explained correctly by the numerical simulations? Does the rapid neutron capture 
process (r-process), which produces e.g. gold and the actinides, take place in 
supernovae? 
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2 Numerical Modeling 
2.1 History and Constraints 

According to theory, a shock wave is launched at the moment of “core bounce” 
when the neutron star begins to emerge from the collapsing stellar iron core. There is 
general agreement, supported by all “modern” numerical simulations, that this shock 
is unable to propagate directly into the stellar mantle and envelope, because it loses 
too much energy in dissociating iron into free nucleons while it moves through the 
outer core. The “prompt” shock ultimately stalls. Thus the currently favoured 
theoretical paradigm exploits the fact that a huge energy reservoir is present in 
the form of neutrinos, which are abundantly emitted from the hot, nascent neutron 
star. The absorption of electron neutrinos and anti-neutrinos by free nucleons in the 
post-shock layer is thought to reenergize the shock, thus triggering the supernova 
explosion. 

Detailed spherically symmetric hydrodynamic models, which recently include 
a very accurate treatment of the time-dependent, multi-flavour, multi-frequency 
neutrino transport based on a numerical solution of the Boltzmann transport 
equation [1,2], reveal that this “delayed, neutrino-driven mechanism” does not 
work as simply as originally envisioned. Although in principle able to trigger the 
explosion (e.g., [3-5]), neutrino energy transfer to the post-shock matter turned out 
to be too weak. For inverting the infall of the stellar core and initiating powerful 
mass ejection, an increase of the efficiency of neutrino energy deposition is needed. 

A number of physical phenomena have been pointed out that can enhance 
neutrino energy deposition behind the stalled supernova shock. They are all linked 
to the fact that the real world is multi-dimensional instead of spherically symmetric 
(or one-dimensional; ID) as assumed in the works cited above: 

(1) Convective instabilities in the neutrino-heated layer between the neutron star 
and the supernova shock develop to violent convective overturn [6]. This 
convective overturn is helpful for the explosion, mainly because (a) neutrino- 
heated matter rises and increases the pressure behind the shock, thus pushing 
the shock further out, (b) cool matter is able to penetrate closer to the neutron 
star where it can absorb neutrino energy more efficiently, and (c) the rise of 
freshly heated matter reduces energy losses by the reemission of neutrinos. 
These effects allow multi-dimensional models to explode easier than spherically 
symmetric ones [7-9]. 

(2) Recent work [10-13] has demonstrated that the stalled supernova shock is 
also subject to a second non-radial low-mode instability, called the standing 
accretion shock instability or “SASI” for short, which can grow to a dipolar, 
global deformation of the shock [12,14,15]. 
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(3) Convective energy transport inside the nascent neutron star [16-18] might 
enhance the energy transport to the neutrinosphere and could thus boost the 
neutrino luminosities. This would in turn increase the neutrino-heating behind 
the shock. 

This list of multi-dimensional phenomena (limited to non-magnetized supernova 
cores) awaits more detailed exploration by multi-dimensional simulations. Until 
recently, such simulations have been performed with only a grossly simplified 
treatment of the involved microphysics, in particular of the neutrino transport and 
neutrino-matter interactions. At best, grey (i.e., single energy) flux-limited diffusion 
schemes were employed. Since, however, the role of the neutrinos is crucial for the 
problem, and because previous experience shows that the outcome of simulations 
is indeed very sensitive to the employed transport approximations, studies of the 
explosion mechanism require the best available description of the neutrino physics. 
This implies that one has to solve the Boltzmann transport equation for neutrinos. 


2.2 The Mathematical Model 

As core-collapse supernovae involve such a complex interplay of hydrodynamics, 
self-gravity and neutrino heating and cooling, numerical modellers face a classical 
“multiphysics” problem. Although the overall problem can still be formulated as 
a system of non-linear partial differential equations, rather dissimilar methods - 
sometimes with conflicting requirements on the computer architecture and the 
parallelization strategy - need to be applied to treat individual subsystems. In 
the case of our code, the system of equations that needs to be solved consists of 
the following components: 

• The multi-dimensional Euler equations of (relativistic) hydrodynamics, sup¬ 
plemented by advection equations for the electron fraction and the chemical 
composition of the fluid, and formulated in spherical polar coordinates; 

• Equations for the space-time metric (or in the Newtonian case, the Poisson 
equation) for calculating the gravitational source terms in the Euler equations; 

• The Boltzmann transport equation and/or its moment equations which determine 
the (non-equilibrium) distribution function of the neutrinos; 

• The emission, absorption, and scattering rates of neutrinos, which are required 
for the solution of the neutrino transport equations; 

• The equation of state of the stellar fluid, which provides the closure relation 
between the variables entering the Euler equations, i.e. density, momentum, 
energy, electron fraction, composition, and pressure. 

In what follows we will briefly summarise the neutrino transport algorithms, 
thus focusing on the major computational kernel of our code. For a more complete 
description of the entire code we refer the reader to [19, 20], and the references 
therein. 
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Fig. 1 Illustration of the phase space coordinates (see the main text) 

2.3 “Ray-by-Ray Plus” Method for the Neutrino Transport 
Problem 

The crucial quantity required to determine the source terms for the energy, momen¬ 
tum, and electron fraction of the fluid owing to its interaction with the neutrinos is 
the neutrino distribution function in phase space, /(r, il, <f>,€, 0, <P, t). Equivalently, 
the neutrino intensity I = c j {lit tic)'' ■ e 3 / may be used. Both are time-dependent 
functions in a six-dimensional phase space, as they describe, at every point in space 
(r, ft, (/)), the distribution of neutrinos propagating with energy e into the direction 
(0, 0) at time t (Fig. 1). 

The evolution of I (or /) in time is governed by the Boltzmann equation, and 
solving this equation is, in general, a six-dimensional problem (as time is usually not 
counted as a separate dimension). A solution of this equation by direct discretization 
(using an Sn scheme) would require computational resources in the PetaFlop range. 
Although there are attempts by at least one group in the United States to follow such 
an approach, we feel that, with the currently available computational resources, it is 
mandatory to reduce the dimensionality of the problem. 

Actually this should be possible, since the source terms entering the hydrody¬ 
namic equations are integrals of / over momentum space (i.e. over e, 0, and 0 ), and 
thus only a fraction of the information contained in I is truly required to compute 
the neutrino effects on the dynamics of the flow. It therefore makes sense to consider 
angular moments of I , and to solve evolution equations for these moments, instead 
of dealing with the Boltzmann equation directly. The Oth to 3rd order moments are 
defined as 


J.H.K.L. ,..(r,tf, 0,e,O = — [ I{r,$,(f),(:,@,0,t)n OX2 ’ 3 ’~dn (1) 

Ait J 

where d£2 = sin 0 d0 d0 , n = (cos 0, sin 0 cos 0, sin 0 sin 0 ), and exponenti¬ 
ation represents repeated application of the dyadic product. Note that the moments 
are tensors of the required rank. 
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So far no approximations have been made. In order to reduce the size of the 
problem even further, one needs to resort to assumptions on its symmetry. At this 
point, one assumes that I is independent of 0 and then each of the angular 
moments of I becomes a scalar, which depends on three spatial dimensions, and 
one dimension in momentum space: J , H,K,L = J, H, K, L(r, ft, (p, e, t ). Thus 
the neutrino moment equations at different angular directions (except for some 
terms which can be accounted for explicitly in an operator split) decouple from 
each other. Therefore, for each “radial ray”, i.e. for all zones of same angle, the 
moment equations can be solved independently. Except for some additional terms, 
this problem is identical to solving Ng x ;V,/, times the moment equations for a 
spherically symmetric star with Ng x N# being the number of grid zones in polar 
direction. As we will explain later, the great advantage of our “ray-by-ray” neutrino 
transport is the easy way to obtain perfect scaling behaviour to a large number of 
cores. 


The System of Equations 

With the aforementioned assumptions it can be shown [19], that in the Newtonian 
approximation the following two transport equations need to be solved in order to 
compute the source terms for the energy and electron fraction of the fluid: 
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These are evolution equations for the neutrino energy density, J , and the neutrino 
flux, H , and follow from the zeroth and first moment equations of the comoving 
frame (Boltzmann) transport equation in the Newtonian, 0(v/c ) approximation. 
The quantities C (0) and C (1) are source terms that result from the collision term 
of the Boltzmann equation, while /3 r = v r /c, /3& = v$/c, and = v^/c, where 
v r , Vi), and v H , are the components of the hydrodynamic velocity, and c is the speed 
of light. The functional dependencies J = J(r,ft,<p,e,t), 

etc. are suppressed in the notation. This system includes four unknown moments 
(/, H , K, L) but only two equations, and thus needs to be supplemented by two 
more relations. This is done by substituting K = /k • J and L = fr-J, where f k 
and fi are the variable Eddington factors, which for the moment may be regarded 
as being known, but in our case are indeed determined from a separate simplified 
(“model”) Boltzmann equation. 

The moment equations (2) and (3) are very similar to the 0(v/c) equations 
in spherical symmetry which were solved in the ID simulations of [21] (see 
Eqs. (7), (8),(30), and (31) of the latter work). This similarity has allowed us to reuse 
a good fraction of the one-dimensional version of the transport part, for coding the 
multi-dimensional algorithm. The additional terms necessary for this purpose have 
been set in boldface above. 

Finally, the changes of the energy, e, and electron fraction, T e , required for the 
hydrodynamics are given by the following two equations 


de 
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(for the momentum source terms due to neutrinos see [19]). Here m b is the baryon 
mass, and the sum in Eq. (4) runs over all neutrino types. The full system consisting 
of Eqs. (2-5) is stiff, and thus requires an appropriate discretization scheme for its 
stable solution. 


2.3.1 Method of Solution 


In order to discretize Eqs. (2-5), the spatial domain [0, r max ] x [# min , # max ] x 
[<Pmin> <Pmax] is covered by N r radial, N# latitudinal, and AL longitudinal zones, 
where $ min = 0 and # max = n correspond to the north and south poles, respectively, 
of the spherical grid and <p mm = 0 and <^ rnax = 2 jr covers the full sphere. (In general, 
we allow for grids with different radial resolutions in the neutrino transport and 
hydrodynamic parts of the code. The number of radial zones for the hydrodynamics 
will be denoted by /\v' v<1 .) The number of bins used in energy space is N c and the 
number of neutrino types taken into account is N v . 

The equations are solved in three operator-split steps corresponding to a lateral, 
an azimutal and a radial sweep. 

In the first two steps, we treat the boldface terms in the respectively first lines of 
Eqs. (2-3), which describe the lateral and azimutal advection of the neutrinos with 
the stellar fluid, and thus couple the angular moments of the neutrino distribution of 
neighbouring angular zones. For this purpose we consider the equations 


1 dS 1 3(sin # 3) 

c dt r sin t? 3# 


13g 1 d(0 v E) 

c dt r sin # 3 <p 


(7) 


where 3 represents one of the moments J or H. Although it has been suppressed 
in the above notation, an equation of this form has to be solved for each radius, for 
each energy bin, and for each type of neutrino. An explicit upwind scheme is used 
for this purpose. 

In the third step, the radial sweep is performed. Several points need to be noted 
here: 


• Terms in boldface not yet taken into account in the lateral sweep, need to be 
included into the discretization scheme of the radial sweep. This can be done in a 
straightforward way since these remaining terms do not include derivatives of the 
transport variables / or H. They only depend on the hydrodynamic velocities 
and Vp, which are a constant scalar field for the transport problem. 

• The right hand sides (source terms) of the equations and the coupling in energy 
space have to be accounted for. The coupling in energy is non-local, since the 
source terms of Eqs. (2) and (3) stem from the Boltzmann equation, which is an 
integro-differential equation and couples all the energy bins. 
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• The discretization scheme for the radial sweep is implicit in time. Explicit 
schemes would require very small time steps to cope with the stiffness of the 
source terms in the optically thick regime, and the small CFL time step dictated 
by neutrino propagation with the speed of light in the optically thin regime. Still, 
even with an implicit scheme > 10 5 time steps are required per simulation. This 
makes the calculations expensive. 

Once the equations for the radial sweep have been discretized in radius and energy, 
the resulting solver is applied ray-by-ray for each pair of angles (&, (p) and for each 
type of neutrino; i.e. for constant (?T (p), N v two-dimensional problems need to be 
solved. 

The discretization itself is done using a second order accurate scheme with 
backward differencing in time according to [21]. This leads to a non-linear system 
of algebraic equations, which is solved by Newton-Raphson iteration with explicit 
construction and inversion of the corresponding Jacobian matrix with the Block- 
Thomas algorithm. 


3 Porting and Scaling on the Cray XE6 “HERMIT” at HLRS 
3.1 Parallelization Strategy 

The ray-by-ray approximation readily lends itself to parallelization over the different 
angular zones. In order to make efficient use of modern supercomputer systems 
with relatively small shared-memory units (e.g. 16 CPUs per node on Cray XE6), 
distributed memory parallelism is indispensable. An MPI version of the VERTEX- 
PROMETHEUS code using domain decomposition was initially developed within a 
cooperation between MPA and the Teraflop Workbench at the HLRS in 2007/2008. 
Since then, the parallelization of VERTEX-PROMETHEUS has been further 
extended to allow good scaling on several thousands of cores as required for future 
3D supernova simulations. 

The VERTEX-PROMETHEUS code employs a hybrid MPI-OpenMP paral¬ 
lelization scheme, in which the parallelization of the transport module - the main 
computational kernel and most CPU-intense part of the code - is along radial “rays” 
for fixed angular bins of the three-dimensional grid. Hence, every “ray” of the 
transport is treated by one core using as many OpenMP threads as cores available 
on an individual node. This strategy allows almost perfect scaling behavior, since 
almost no MPI communication is necessary between individual rays during the 
transport step. 

The MPI-parallelization of the much less expensive hydrodynamical part 
PROMETHEUS is based on standard domain decomposition methods. Hereby, the 
reconstruction scheme used to solve the hydrodynamic equations requires so-called 
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Fig. 2 Schematic sketch of the MPI communication pattern for an angular direction in the 
hydrodynamics part of our code. The red rectangle symbolizes the data available in a specific 
MPI task, the surrounding area marked by the dashed line reflects the zones to be communicated 


“ghost-zones”, which have to be available in each MPI task. In our case, four ghost 
zones are required on each cell interface in angular directions to integrate one time 
step and these zones have to be MPI communicated to the neighbouring MPI tasks. 
A sketch of grid zones to be MPI communicated is illustrated in Fig. 2. 


3.2 Porting VERTEX-PROMETHEUS to the Cray XE6 
“ HERMIT” 

As demonstrated in Fig. 3, we have already obtained excellent scaling behavior 
with the explained parallelization strategy. For example, we have performed scaling 
tests on the BlueGene/P system JUGENE at the Forschungszentrum Jiilich to 
demonstrate that our VERTEX-PROMETHEUS code scales perfectly up to 65,000 
cores. 

Since our VERTEX-PROMETHEUS code runs successfullyon several architec¬ 
tures, the code should in principle work out of the box. However, we had to change 
several smaller statements in order to be able to compile the code. Furthermore, 
while performing the first scaling test on the Cray of the HLRS we detected that the 
routine, which calculates the most important neutrino interaction rates, shows poor 
performance. Initially, we have used the same version of this part of the transport 
solver, which performs perfectly using the Intel compiler. To obtain better results on 
the Cray XE6 we have rewritten this routine and we use now a vectorized version 
with one main loop. 

Employing this single optimization, the code scales well on up to 32,000 cores 
of the Cray XE6 at HLRS as shown in Fig. 3. However, the scaling behavior is 
still slightly worse than on Intel platforms. We plan to analyse the detailed code 
performance on the Cray XE6 further to get better results of the scaling tests. 
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#cores 

Fig. 3 Strong scaling of VERTEX-PROMETHEUS on different machines and architectures. The 
colored lines show the speedup on the respective machine relative to the run with the smallest 
number of cores for a given problem size. The symbols mark the number of cores on which the 
timings were done. The dashed black lines indicate a theoretically perfect scaling behavior. Note 
that several lines lie on top of each other. Also note that due to limited memory on BlueGene/P 
three different setups were timed, which are shown as three separate speedup curves 


Another point concerning the special characteristics of the Cray XE6 is the strong 
interconnection of the individual nodes. We cannot profit a lot by this feature since 
our code needs only a low amount of communication (less than 5 % of the total 
computing time). 

Furthermore, we want to improve the performance of I/O on the Cray XE6. The 
I/O is now handled by means of parallel HDF5 to ensure high scalability and to 
eliminate the excessive memory consumption asscociated with temporary I/O arrays 
on the root node. The handling of I/O performs quite well on IBM BlueGene and 
Intel systems, however we want to optimize I/O on the Cray XE6 further. 


4 Conclusion 

We have presented our main simulation tool VERTEX-PROMETHEUS. In the 
past years, we have developed a fully MPI/OpenMP parallelized code version to 
be able to perform large scale runs on several thousand cores. At the moment 
our code shows excellent scaling behavior on several platforms. After the new 
Cray XE6 “HERMIT” had become available at HLRS, we have ported VERTEX- 
PROMETHEUS to this new system. With minor optimizations (required by the 
compiler) the code scales now up to 32,000 cores. 
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Since our code is now ready to run on the new Cray XE6 at HRLS, we are 
ready to start the first generation of three-dimensional simulations of core-collapse 
supernova explosions this year. This simulations are extremely expensive (several 
10 20 floating point operations) that we need to strongly rely on Tier-0 systems 
such as “HERMIT”. Only systems like the new Cray XE6 in Stuttgart give us the 
possibility to advance our understanding of the details of the explosions mechanism 
of core-collapse supernovae. 
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Stability of the Strongest Magnetic Fields 


Konstantinos D. Kokkotas, Burkhard Zink, and Paul Lasky 


Abstract Neutron stars with the strongest magnetic fields, known as magnetars, 
contain fields up to nine orders of magnitude stronger than those produced in the 
laboratory. While the field exterior to the star is thought to be dominated by a 
roughly dipolar structure, the interior field is entirely unknown, and is currently 
a hotly debated topic in astrophysics since it is thought to be connected with huge 
gamma-ray outburst, the gicint flares, and possibly also with gravitational radiation. 


1 Scientific Background 

No known object in the universe carries stronger magnetic fields than a magnetar, 
a neutron star endowed with a surface magnetic field of approximately 10 15 G in 
strength [1], Neutron stars in general represent the most extreme objects in the 
Universe, with masses 1.4 times that of the sun compressed in a region less than 
30 km in diameter. The state of matter inside these stars is largely unknown, and 
efforts to predict their internal structure with the use of gravitational-wave detectors 
are the subject of current research. 

Germany is very active in the field of gravitational-wave detection: Our group in 
Tubingen forms part of one of the prestigious transregional SFB (Sonderforschungs- 
bereiche), which is focused on “Gravitational Wave Astronomy,” and it has recently 
entered its third period after a very positive review. We are part of these activities 
in the form of two projects which contribute modeling efforts for neutron stars and 
magnetars as possible gravitational wave sources. 
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Due to their ultra-strong magnetic fields, magnetars contain a substantial amount 
of electromagnetic field energy, which they release in both continuous radiation, 
frequent flares and storms, and sometimes in so-called “giant flares” with energies 
greater than 10 44 erg [2]. Moreover, these events are so energetic that they are 
potential sources of gravitational radiation observable from Advanced L1GO, the 
upcoming upgrade of the LIGO gravitational wave detectors which have already 
been in operation for a few years [3]. In addition, the tails after the giant outburst 
also exhibit a set of normal mode oscillations, some of which are likely core plasma 
(Alfven) waves excited by the outburst [4,5]. 

Gravitational wave luminosities and post-flare modes are likely connected to the 
internal structure of the magnetic field inside the magnetar. However, despite their 
importance, this internal structure is entirely unknown. Simple analytical models 
with purely dipolar or toroidal field structure are commonly used for modeling 
purposes, but such models are known to be unstable [6,9]. It is probable that actual 
core magnetic fields in neutron stars are much more complex. Since neutron stars 
are very compact objects, general relativistic magnetohydrodynamics (GRMHD) 
simulations are needed to investigate these strongest magnetic fields. 


2 Report 

The main goal of this project is to investigate the dynamical stability of magnetic 
fields inside highly magnetized neutron stars, and the associated gravitational wave 
signal possibly connected with giant flare events in magnetars. In our initial study 
of the subject [9] we have found that purely poloidal magnetic fields in general 
relativistic neutron star models are dynamically unstable to both low-order varicose 
(sausage-like) and kink instabilities, and we have followed this evolution over 
several hundred milliseconds as the instability is driven towards the saturation state. 
Following this process for over a hundred Alfven times (and tens of thousand of 
sound crossing times) is unprecedented in this field, and competitive efforts to 
do so had to stop evolving after only 30 or 40 ms. This new capability is in part 
made possible due to the use of GPU computing, which allowed us to employ the 
massively parallel architectures to evolve for such long evolution times, as well as 
reduce experimentation turnover times. 

The main result obtained in the first paper is that well-known axisymmetric 
equilibrium fields, which are often used as exact background configurations for 
two-dimensional simulations or perturbative studies, are in fact always unstable, and 
therefore these simplified models are generally very approximate. This is consistent 
which earlier studies on magnetic field stability in main-sequence stars [6]. We 
do, however, observe a somewhat different saturation state than what would be 
expected from simple “twisted-torus” configurations which have been discussed in 
the literature. 

Such a twisted-torus state is a mixture of poloidal and toroidal fields, recognizing 
that both purely toroidal and purely poloidal fields are expected to be dynamically 
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unstable. In between, Newtonian MHD studies have presumably found stable 
configurations [7], so it is natural to expect that the result of the reconfiguration 
of magnetic fields starting from a dynamically unstable state may result in a con¬ 
figuration similar to a twisted torus, in particular since the poloidal field instability 
itself tends to produce toroidal components. In fact, we do find a saturation state 
with substantial toroidal components, but only in parts of that state, we also observe 
something akin to a twisted torus. Overall, the final state of our simulations is in 
fact highly nonaxisymmetric and still quite dynamic, so either the system needs 
much longer timescales to settle down, or we have found quasi-equilibria of a 
truly complex structure, which could actually represent a possible more realistic 
configuration for neutron star interiors in general. If confirmed in the future, this 
would constitute an intriguing result. 

If a giant flare in a magnetar is indeed connected with a large rearrangement 
in the internal magnetic field, then this process could be an observable source of 
gravitational radiation. Magnetic fields inside magnetars could exceed 10 15 or even 
10 16 G in amplitude, and a large-scale dynamical process would not only be a burst 
source (though possibly a weak one), but also excite various modes inside the 
star which could oscillate until some damping mechanism kicks in. The prospect 
of obtaining a direct signal from the magnetic field inside a neutron star is both 
intriguing from the perspective of understanding the giant flare mechanism better, as 
well as from the viewpoint of asteroseismology, since oscillation modes such excited 
could five us insights into the properties of the supernuclear matter in the core. 

However, the most important question is: will we be able to actually observe a 
signal from these events, even if a violent rearrangement of a strong magnetic field 
happens as indicated? This question has been hotly debated in the literature, and 
in a recent follow-up paper [10] we have given the first answer for a wide range 
of possible field strengths based on general relativistic magnetohydrodynamics 
simulations. 

In a series of simulations for the poloidal field instability, we have investigated 
the gravitational wave output over hundreds of milliseconds The field strengths 
ranged from about 10 15 G at the surface up to (unrealistically high) 10 17 G. This 
wide range was chosen to measure the scaling behavior of gravitational wave strain 
and energy with the magnetic field strength, since even if the particulars of our 
model are not represented in a magnetar giant flare, we expect such scaling laws 
to be rather robust results. On the other hand, this allowed us to gain an order of 
magnitude estimate of the kind of signal we could expect from a giant flare under 
optimistic conditions. So far, estimates in the literature spanned a range of many 
orders of magnitude, so even a rough estimate has not been available until then. 

The primary agent for gravitational-wave emission from neutron stars is the 
fundamental quadrupole mode (1 = 2 f -mode) since it couples efficiently into 
gravitational radiation. If a large dynamical rearrangement of the magnetic field 
happens inside a star, the resulting fluid motions can have an overlap with this 
/-mode and thereby excite observable oscillations during the ringdown timescale 
(approximately 100 ms). Such a monotone signal (the frequency hardly shifts due 
to the presence of the magnetic field, dynamic or not) is an optimal target for 
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observation even under noisy conditions. However, current interferometric detectors 
are not very sensitive in the frequency range of the mode (kHz), which somewhat 
offsets its advantages. Still, it is the most obvious emission mechanism, and has 
primarily been discussed in the literature. 

We have found that the catastrophic rearrangement of magnetic fields following 
a dynamical instability is in fact a rather weak motor for gravitational radiation, 
certainly well below some of the more optimistic results in the literature. In fact, 
assuming typical damping times, an /-mode excited from even a rather strong 
internal field will not yield an observable signal in present or currently planned 
detectors according to our simulations. Though disappointing, this is of course 
a s significant insight when discussing potential targets for gravitational wave 
observations. Some hope comes from the fact that we have not covered the whole 
space of dynamical instabilities yet, but it is hard to conceive why say a toroidal 
unstable field of similar strength should produce a signal orders of magnitude 
stronger. 

But there is another result we have obtained in that paper, which is far more 
promising in terms of observation: The magnetic field dynamics themselves give of 
course both rise to a burst signal (during the reconfiguration), as well as possibly 
later local or semi-global modes of oscillation when the field has mostly settled 
down. The Alfven waves timescale gives rise to modes in the range of roughly 
ten to hundreds of Hertz, which is precisely in the range of optimal sensitivity of 
interferometric detectors, and incidentally also overlaps with the frequency range 
spanned by the quasi-periodic oscillations (QPOs) seen in the tail of giant flares. 

In large part due to the use of GPU computing, we can now afford to simulate 
neutron stars for hundreds of milliseconds or even seconds. Therefore we were able 
to investigate gravitational wave output also, and for the first time in the published 
literature, in this low-frequency range which is so interesting for future observations. 
We did indeed find a substantial amount of energy in this range, likely a mixture of 
the (still ongoing) rearrangement due to the dynamical instability, and Alfven modes 
travelling along the field lines. While it is hard to disentangle these effects precisely, 
we do observe a substantial gravitational wave output from these modes. It is still 
well below what can be observed at present, but future gravitational wave detectors 
like the proposed Einstein Telescope could be able to observe parts of the signal 
under benign conditions. The unknown damping timescale of the Alfven modes, 
and the question of frequency stability of the modes, will have to be addressed in 
the future to say more about this, but it is still an exciting result. 

In extension of the work above, we have recently finished a more detailed 
investigation of the nature of the large-scale magnetic field instability under different 
conditions [11], While the earlier studies were focussed on a “canonical” model 
of a neutron star (a nonrotating polytrope with K = 100 and 7” = 2) which is 
often employed in the literature, we have extended the parameter space to include 
different equations of state and also rotation. There are a number of reasons for this: 
first, it is important to estimate how robust results obtained in a special case are 
under different (but still reasonable) conditions. Connected with that, we wanted to 
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estimate the parameter dependence of the dynamical instability, saturation state, and 
of course also the level of gravitational wave emission. 

The equation of state of a real neutron star is largely unknown. In fact, 
finding constraints on the EOS is one of the major goals of modern neutron star 
research, since this ties directly into nuclear physics and the properties of matter 
at supernuclear densities (as are present in the core of the neutron star). So far, we 
employ a popular, but simple and parametrized model, the polytrope, in the hope 
that the parameter space spanned by the models represents, to some extent, the full 
variety of behavior of real neutron stars, at least as far as the questions we are trying 
to address are concerned. In the most recent paper, we have varied the stiffness of the 
EOS (which also determines the maximum possible mass before collapse to a black 
hole would occur, and the mass-radius relation) between very low and very high 
values to observe how the gravitational wave output changes with this important 
parameter. 

The other parameter we have varied is the rotation of the neutron star, and 
this deserves some explanation. Observed magnetars are slowly rotating (periods 
in the order of 10 s, which is high compared to the sound crossing timescale of 
milliseconds), so the initial focus on nonrotating models was a reasonable first step. 
However, we wanted to understand the behavior of the instability when rotation is 
introduced on grounds of principle, and we have therefore performed a number of 
simulations of neutron star models with low to very rapid rotation rates (periods in 
the order of milliseconds) to observe the dependence of the dynamical instability 
on the rotation rate. It should be added that strongly magnetized neutron stars are 
probably not bom slowly rotating, so understanding the stability of magnetic fields 
in their interior is not just an academic exercise. 

The results of this extended study can be summarized as follows: (i) gravitational 
waves from /-modes caused by magnetar flares are unlikely to be detected in 
the current or near-future generation of gravitational waves observatories, (ii) 
gravitational waves from Alfven waves propagating inside the neutron star are more 
likely candidates, although this interpretation relies on the unknown damping time 
of these modes, (iii) any magnetic field equilibria derived from our simulations 
are characterized as non-axisymmetric, with approximately 65 % of their magnetic 
energy in the poloidal field, (iv) rotation acts to separate the timescales of different 
instabilities in our system, with the varicose mode playing a more major role due to 
a delayed kink instability and (v) despite the slowing growth rate of the kink mode, 
it is always present in our simulations, even for models where the rotational period 
is of the same order as the Alfven timescale. Details can be found in [11], 

We are now at the point where we have explored a good part of the essential 
parameter space of the magnetic field instability. Future work will include addi¬ 
tional field topologies (toroidal fields) and also differential rotation. The latter is 
particularly interesting since differential rotation can interact with magnetic fields 
in interesting ways, e.g. via dynamo effects or the magnetorotational instability. We 
now just begin to really understand how magnetic fields in neutron stars look like, 
and how they interact with the observed phenomena. 
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Table 1 Each simulation used 1 GPU. The simulations investigate magnetic field stability in 
rotating magnetized neutron star models. Each neutron star is a rapid rotator with 400 Hz frequency, 
and a central magnetic field strength of 4.8 x 10 16 G. Some simulations additionally have an 
imposed toroidal field component 


Simulation 

Time 

SUs 

Notes 

4.8el6G, 400Hz 

192 h 

1536 

Rotating magnetized simulation of purely poloidal field. 

4.8el6G, 400Hz, tor 0.1 

192 h 

1536 

Imposed toroidal field. 

4.8el6G, 400 Hz, tor 0.5 

192 h 

1536 

Imposed toroidal field. 

4.8el6G, 400Hz, tor 1.0 

192 h 

1536 

Imposed toroidal field. 


3 Codes 

All studies described above were performed with the Horizon code [8]. The code 
solves the equations of general relativistic magnetohydrodynamics (GRMHD) on 
general spacetimes and in three dimensions, and is therefore ideally suited to 
investigate the complex and often nonsymmetric dynamics of magnetic fields in 
neutron stars. 

Horizon is fully implemented in CUDA, and is therefore entirely ported to 
make use of NVIDIA GPUs. Since the code operates on a regular uniform mesh, 
many operations can be parallelized/vectorized in a straightforward manner, and 
coherency in the memory access patterns (which is important to obtain a high 
memory bandwidth on GPUs) is possible in many cases. While these statements 
would equally apply to Newtonian MHD, GRMHD has in addition a higher 
arithmetic complexity than standard MHD codes, so that we, overall, obtain a very 
high speedup compared to a reference CPU implementation. 

All major operations are performed on the GPU, and no memory transfers over 
the PCI bus are performed to update the simulation as long as only one GPU is 
employed. The code also has an experimental MPI implementation, but we did not 
yet employ this part of the code in our studies, though this is planned for the future. 


4 Resource Usage 

We have employed several machines to perform the simulations leading to the 
publications mentioned above. Initially, the nehalem cluster at HLRS was the only 
larger resource available to us, but subsequently one of us (P. Lasky) obtained 
access to a new, large machine in Australia, which also coincided with his move 
from Germany to Melbourne. For practical reasons, many of the simulations were 
performed on this so-called MASSIVE cluster, and only a small number on the 
nehalem cluster in Stuttgart. We have collected a list of the simulations performed 
in Table 1. 
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Part II 

Solid State Physics 

Prof. Dr. Holger Fehske 


The computational treatments in the fields of solid state physics and material science 
were directed towards the electronic and structural properties of fascinating systems 
ranging from metals via ceramics, oxides, semiconductors and their surfaces to 
nanostructures. The six contributions selected reveal that research in this area 
extraordinarily benefits from the supercomputing facilities of the High Performance 
Computing Center Stuttgart during this funding period. 

A good example is the work by the Stuttgart group (J. Roth et al.) on laser 
ablation in metals that has been carried out with fully parallelized IMD, a molecular 
dynamics simulation package. In this problem, laser radiation directly acts on 
the metal’s nearly-free electrons, which after excitation to a non-equilibrium 
state, quickly equilibrate. For this reason a two-temperature model with separate 
temperatures for electrons and phonons applies. The resulting set of coupled heat- 
conduction differential equations for electrons and ions gives the time evolution 
of the electron and lattice temperature within the system. On an atomistic scale 
the continuum description has to be replaced by molecular dynamics. Choosing 
aluminium as a reference system, the authors first determined the electronic material 
properties (e.g. the heat capacity and the electron-phonon coupling strength) and 
afterwards analyzed the melting depth and the ablation threshold, also if a two-pulse 
sequence with long time separation or overlapping pulses were applied. 

The ab initio study by B. Hoffing and F. Bechstedt from the European Theo¬ 
retical Spectroscopy Facility and the University Jena deals with electronic surface 
properties of the transparent conducting oxides ImOs, ZnCF, and ZnO. Employ¬ 
ing the repeated-slab supercell method, the authors have carried out density 
functional (DFT) calculations which include many-body effects perturbatively in 
order to determine ionization energies and electron affinities for different surface 
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orientations and terminations of the oxides. Thereby the quasiparticle equation 
for the electronic self-energy is solved on top of the self-consistent solution of a 
generalized Kohn-Sham equation using Hedin’s GW method. This allows to analyze 
the surface stability in dependence on the chemical potential of the oxygen atom 
respectively molecule. The numerical results indicate a strong dependence of the 
surface barrier on the orientation and termination of the surface which might be 
caused by the strong surface dipoles existing in these ionic compounds. 

A technically related, but physically different project addresses surface states at 
silicon surfaces with different hydrogen coverage, which is a central challenge on 
the way to optimize solar cells. The technologically highly relevant investigations 
by the Paderborn’s group (U. Gerstmann et al.) were based on a gauge-including 
projector augmented plane wave approach in the framework of DFT with a 
gradient-corrected PBE functional. The focus is on electron paramagnetic resonance 
fingerprints of the paramagnetic states created by hydrogen adsorbed on Si( 111) and 
Si(001). At this the numerics provides the microscopic structure and magnetization 
density of a paramagnetic hydrogen vacancy at different-oriented Si surfaces. 
Interestingly the calculations could be extended to micro- and nano-crystalline 
3C-SiC being basic solar cell materials. 

The group from the Stuttgart MPI for Solid State Research (G. Bester and 
P. Hang) has continued its investigations of the vibrational properties of both 
passivated and unpassivated nanoclusters. The authors discuss how molecular-type 
surface-acoustical and surface-optical modes coexist and interact with bulk-type 
vibrations to the point where structural changes were induced by the surface. Their 
ab initio DFT calculations give deeper insight into the thermodynamics of III-V and 
II-VI colloidal nanoclusters with up to thousand atoms. Of particular importance 
in this respect seems to be the behavior of the low-temperature specific heat. 
Noteworthy, coherent acoustic modes could be identified in good agreement with 
experiment. 

To re-emphasize, all the projects in the rapidly growing field of computational 
solid state physics have in common, besides the high scientific quality, the strong 
need for computers with high performance to achieve their results. Therefore the 
new leading edge supercomputer technology being available at the HLRS will play 
an essential role in their physical research. 


Molecular Dynamics Simulations of Laser 
Ablation in Metals: Parameter Dependence, 
Extended Models and Double Pulses 


Johannes Roth, Johannes Karlin, Marc Sartison, Armin Krauli, 
and Hans-Rainer-Trebin 


1 Introduction 

Laser ablation has become a very useful tool in machining today. For example 
for drilling holes, welding, engraving or coating by deposition of laser-irradiated 
material. The opposite process, laser removal of material is in general called laser 
ablation and some aspects of this process shall be discussed here. 

If the target of laser ablation are metals, and the pulses applied have durations 
of a few femtoseconds, then the time evolution of the process can be described in 
the following way: The laser acts on the free electrons of the metal and excites 
them. The next step is a thermalization of the electrons which leads to an electron 
temperature different from the ordinary lattice temperature. Then the electron 
system and the lattice start to exchange heat which means especially heat conduction 
into the bulk by the electrons. Up to this point the process is dominated by the 
quantum nature of the material. The lattice is heated up by the energy obtained from 
the electrons, it melts, and finally ablation occurs. The latter processes now take 
place on the scale of several picoseconds and can thus be simulated by classical 
molecular dynamics simulations. 

Since we want to simulate large samples with millions to billions of atoms, we 
cannot use ab-initio-methods to study the quantum effects. Instead of that we apply 
a continuum model, the so-called two temperature model (TTM) [1, 3, 7], which 
consists of two coupled heat balance equations formulated for the electrons and 
lattice as a function of the temperatures mentioned above. The lattice equation 
will later be replaced by molecular dynamics (MD) simulations which allows us 
to obtain atomistic information about the ablation process. The combined model is 
called TTM+MD. 
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The behavior of a material in classical MD simulations is given by the structure 
(initial conditions) and the interaction. The question is now how this behavior is 
driven by the coupling to the electron system whose quantum nature is described 
by the continuum model. In the case of metals there are three relevant parameters: 
electron heat conductivity, electron heat capacity and electron-phonon coupling. 
With respect to these parameters all metals can be divided into classes. Keeping 
the interaction and the crystal structure fixed we have varied the parameters within 
the experimentally observed range. In a first part of this work we will present results 
on this study. 

In Fourier’s law for the heat flow generated by a temperature gradient it is 
assumed that the temperature change is instantaneous. This might not be true in 
general for femtosecond pulses. Several people have generalized Fourier’s law 
to include a finite relaxation time [4, 5]. If combined with the conservation of 
energy this leads to a generalization of the electron heat balance equations to a 
heat wave equation, also called the telegraph equation. For the coupled equations 
this is called an extended TTM (ETTM) and combined with MD an ETTM+MD 
model. We show numerically by solution of the ETTM and by simulations with the 
ETTM+MD model that the simple TTM leads to satisfactory results for example 
for Al and Cu, but that the ETTM has to be applied in case of Pb for example. 

Drilling holes with femtosecond lasers requires thousands of pulses with time 
intervals of hundreds of microseconds between two consecutive pulses. The com¬ 
plete process is obviously beyond the realm of classical molecular dynamics 
simulations. But since the sample cools down completely between two pulses we 
can simply run two simulations at pristine and ablated samples thus skipping the 
long interval in between. In the work presented here we have reduced the time 
interval even more such that the sample is not cooled down completely. Yet another 
aspect is the temporal shape of the pulses which is usually Gaussian. But it is 
predicted from experimental studies with increasing or decreasing overlapping pulse 
sequences that non-Gaussian pulses shapes may be more effective in ablation. We 
have studied two cases of two overlapping Gaussian pulses, one increasing, one 
decreasing and will report first results. 


1.1 Molecular Dynamics Simulations 

All the simulations have been carried out with IMD, the ITAP Molecular Dynamics 
simulation package [11,17]. 1 IMD contains modules to simulate laser ablation with 
the two-temperature model [9,21]. Most of the features are fully parallelized using 
MPI. With respect to details of the usage and the parameters applied for ablation we 
refer the interested reader to the IMD homepage. 1 


'Available at http://www.itap.physik.uni-stuttgart.de/~imd 
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2 The Two-Temperature Model and Its Coupling 
to Molecular Dynamics 


In femtosecond laser ablation the laser radiation acts directly on the free electrons 
of the metal, exciting them to a non-equilibrium state. If the electrons equilibrate 
fast enough it is possible to define a separate temperature for them. This is the basis 
of the two-temperature model (TTM) where separate temperatures for the electrons 
and the lattice or phonons are introduced. The model was introduced by Anisimov 
et al. [1] based on ideas formulated by Ginzburg, Kaganov and others [3,7]. Laser 
ablation is then described by a system of coupled heat conduction equations for the 
electrons and the ions separately: 


S7[K e VT e \-K(T e -Ti) + S(x,t), (1) 

V[K ( -V7]] + x(T e — Ti). (2) 

Equations (1) and (2) describe the time evolution of the electronic (T e ) and the ionic 
(Ti) or lattice temperature within the metal. C e4 are the heat capacities, K c i the 
heat conductivities, k the electron-phonon coupling constant and S(x, t) the external 
laser field. With these equations the laser held is coupled physically meaningfully 
into the system: first the energy is brought into the electronic system via a source 
term .S' (x, /). Then the electronic system transports the heat diffusively into the bulk 
and at the same time interacts with the ions. 

To work on an atomistic scale, Eq. (2), which is a continuum description of the 
temperature held, has to be replaced by molecular dynamics. This means that instead 
of (2) the following equations of motion have to be solved together with (1): 


c -™f = 


C(T)^ 
LA l) 3 1 


d 2 Xj 


-V X/ E/({xa-}) • 


x (Ti 


C, 


T e ) dxj 

- m j — 

7 dt 


( 3 ) 


This is the coupled TTM+MD model [6,14]. The lattice parameters C, and K, are 
no longer present, since they are intrinsic properties of the atoms. The coupling 
parameter k has to be translated to atomistic observables which can be done 
by division by C/, the atomistic heat capacity which is obtained directly in the 
simulations. 

Thus we are left with the parameters C e , K e , and k which can vary in a broad 
range. The heat capacity C e which is a linear function of temperature for metals is 
replaced by the heat capacity coefficient y = C e l T which is constant over a broad 
temperature range. 
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3 Parameter Studies of Electronic Material Properties 

While the heat capacity and heat conductivity of the ions are intrinsic properties 
of the interaction we note that the electron properties might vary in a broad range. 
The idea here is to keep the atomistic part, namely the crystal structure and the 
interaction fixed, vary the electronic parameters in the range observed in experiment, 
and determine how strongly the melting depth and ablation threshold depend on 
these parameters. More details may be found in [13]. 

As atomistic reference system we chose aluminum. First of all, since it is well 
studied and has previously been used in our simulations [16, 18]. Second, since it 
turns out that it is a good reference point centering the range of parameter values 
observed for other metals. 


Parameter Range and Sample Properties 

The parameters for standard Al are the electron heat capacity coefficient y = 
135J/(m 3 K 2 ), the electron-phonon-coupling constant k = 2.45-10 17 J/(m 3 Ks), and 
the electron thermal conductivity K e = 235J/(sKm) (see for example Appendix 
C of the book of Bauerle [2]). The laser fluences 2 for the study of the melting 
depth were varied between 457.77 and 951.44 J/m 2 in steps of 114.4J/m 2 . K e was 
increased from 67.8 to 406.8 J/(s Km) in steps of 67.8 J/(s Km), y in four steps from 
1.19 to 118.97 J/(m 3 K 2 ) and k in four steps from 1.36-10 16 to 6.78-10 18 J/(m 3 Ks). 
The sample volume was 184.44 x 4.84 x 4.84 nm 3 containing about 260,000 atoms. 


Results for the Melting Depth 

In all cases we keep two of the three parameters fixed at the aluminum value and 
vary the third one. If the melting depth is plotted as a function of fluence and 
heat capacity coefficient y (Fig. 1, top) we find only a weak dependence on this 
parameter. The reason is that the amount of energy stored in the electronic system is 
rather small. If the melting depth is plotted as a function of fluence and the thermal 
conductivity K e (Fig. 1, middle) we observe an almost linear increase with respect 
to this parameter. The reason is that the energy can penetrate deeper into the bulk 
for higher conductivities. If the melting depth is plotted as a function of fluence 
and the electron-phonon coupling constant k (Fig. 1, bottom) we observe a region 
between 10 16 and 10 17 J/(m 3 Ks) where no melting is induced. The coupling is too 
small and the energy stays in the electron system. At high k and for higher fluences 


2 The strength of a laser beam is typically given by power per area, called laser fluence. 
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Fig. 1 Melting depth as a function of fluence and parameters. From top to bottom', heat capacity 
coefficient y, thennal conductivity K e , electron-phonon-coupling parameter k 
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the the melting depth increases again with the parameters. The reason is similar to 
the previous case: a better coupling leads to a higher energy transfer which finally 
melts the lattice. 


Results for the Ablation Threshold 

The ablation threshold can suitably be fitted by simple functions: 

1. The dependency of threshold fluence on the heat capacity coefficient can be 
described by a power law F(y) = a-y b + c with a = 0.00706 m 4 K 4 /J ±0.41 %, 
b = 1.98 ± 2.5%, c = 1144J/m 2 ±0.044%. Here, y is limited to the range 
between 1.19 and 118.97 J/(m 3 K) since outside this region no ablation has been 
observed. The ablation threshold increases with increased energy storage, but the 
influence is again rather small. 

2. The threshold fluence is a linear function of the thermal conductivity over the 
whole range of parameters F(K e ) = a ■ K e ± b with a = 0.284 K s/m ±3.6% 
and b = 160 J/m 2 ±1.7 %. For higher conductivities a larger fluence is required 
since heat is lost by diffusion into the bulk. 

3. The threshold fluence with respect to the electron-phonon-coupling has been 
fitted with a linear function although here the dependency is not so obvious as 
can be concluded from the error bars: F(k) = a ■ k ± b. The values are a = 
— 1,351 mKs ±28.9%, b = 184.711 J/m 2 ±0.93 %. If at > 6.78 ■ 10 18 J/(m 3 Ks) 
the coupling is so strong that no stable system can be simulated. For higher 
couplings a smaller fluence is sufficient since heat is transfer faster to the lattice 
and leads to ablation. 


Summary of the Parameter Studies 

In summary we conclude that the behavior of the ablation threshold is as expected: 
the electron system stores little energy, if the thermal conductivity is high, the heat 
flows into the bulk and leads to deep melting and increasing ablation threshold. A 
high electron-phonon coupling constant on the other hand leads to earlier melting 
and ablation, thus requiring less fluence. In general the situation is more complicated 
since the parameters are temperature-dependent. The metals may be partitioned into 
three classes: Al where the parameters y and k are merely rescaled, Ag, Cu, Au with 
a high temperature region of increased parameters and Ni, Pt, W with decreasing 
parameters [10]. A study like ours would now require to work with average or 
effective parameters. 
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4 An Extended Two-Temperature Model 
Derivation of the Model 

In Fourier’s law of heat flow q(x, t) = — K e V7(x, t) and in the standard two- 
temperature model [1,3,7] it is assumed that the temperature change is instantaneous 
and the relaxation time of the electrons x e is zero. If a finite relaxation time r e is 
introduced for the electrons [4,5] we get the modified law for the heat flow 

r e 3,q(x, t) = —q[(x, t) + K e VT(x,t)] (4) 

which together with the conservation of energy leads to a heat wave equation: 

3 2 T e dT e 3 3 

x e C e (T e )-^-+C e (T e )^~ = \'[K e VT e \—K(l + x e — )(,T e —Ti)+(l + x e —)S(x, t) 

( 5 ) 

Comparing the new equation to the original equation (1) we observe that in addition 
to the second time derivative on the left side which has frequently been added 
phenomenologically, the electron-phonon coupling parameter k and the laser source 
term S(x, t) have to be supplemented by a differential operator. 


Results for ETTM 

The extended two-temperature model (ETTM) has been solved numerically by 
Hiittner and Rohr [4] and reproduced by us for 1 ps pulses. For A1 (x e = 0.067) 
the differences between the TTM and the ETTM are negligible. They are significant 
for Cu ( x e = 0.276) where a difference in peak temperature of 1,000K and final 
temperature difference of 500 K is observed. They grow even larger for Pb(r e = 
0.377) with a peak temperature difference of 6,000 K and a final temperature 
difference of 1,000 K. 


Results for ETTM+MD 

We studied the ETTM+MD model for a test sample with 180,000 atoms and a 
number of different metals. Standard embedded atom interactions were applied [20]. 
The electron and lattice temperatures were monitored as a function of time at a 
certain depth within the sample. In all cases we find that the lattice temperature 
is completely unaffected by a finite relaxation time. The behavior of the electron 
temperature is presented in Fig. 2 for 0.1 ps pulses. For A1 and even for Pb there 
are only minor differences between the two models although the relaxation time 
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Fig. 2 Electron temperatures in the TTM+MD and the ETTM+MD model. From left to right: 
Al, Cu and Pb. The results for TTM+MD are printed in blue and for ETTM+MD in red 


increases by a factor of 4. For Pb there are deviations between TTM+MD and 
ETTM+MD, but considerably reduced if compared to the TTM vs. ETTM case, 
i.e. at the peak temperature from 6,000 to 3,000 K and at the final temperature from 
1,700 K to a very low value. 


Summary for the ETTM 

The results for ETTM and ETTM+MD are not directly comparable due to 
the different parameters used. In general the standard two-temperature model is 
certainly sufficient for long pulse durations, i.e. longer than the electron relaxation 
time x e . The extended two-temperature models are required if x e is large as in the 
case of Pb. The heat transport is delayed in ETTM and ETTM+MD which leads to 
a higher electron temperature since the heat flow to the lattice ions is reduced. The 
wave nature of the extended equation (5) does not play a role in the regime studied. 


5 Properties of Two-Pulse Sequences in Aluminum 

A two-pulse sequence with a long time separation and two overlapping pulses 
formed by two increasing and decreasing Gaussians have been simulated to study 
the effect of pulse shapes on the effectivity of ablation. More details can be found 
in [8], 


Pulse Shapes 


The illumination of the samples with the laser beam was chosen homogeneously 
due to their small cross sections. Thus the intensity distribution is given by a one 
dimensional function 


Sid = (1 ~ R) ■ fi exp (-px)oE ■ git) 


(6) 
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The reflectivity R is set to 0.825, // is the inverse absorption depth and g(t) the 
temporal shape of the laser pulse. The energy density per area parameter ge for 
aluminum is 144.18 J/m 2 . 

The starting point was a Gaussian pulse with width a, = 0.4 ps: 


go(0 = 


1 


exp 


/ i (t-t 0 ) 2 \ 

v 2 y y 


(7) 


For all simulations to was set to 1.018. 

The shapes of the double pulses can be described with the following formula 


gi( 0 = 




1 (t - to) 1 \ 

2 a} J 


+ bi exp 


H^)) 


( 8 ) 


with amplitudes a, and h, and time interval c,. 

For the separated double pulse we set a \ = b\ = I andci = 15. For the increasing 
pulse we set a-i = 3/4, bi = 1 and C 2 = 2. For the decreasing pulse we set a 3 = 1 , 
bi = 3/4 and C 3 = 2. 


Results of the Simulation of Two-Pulse Sequences 

The sample size for the simulations was 241.92 x 4.84 x 4.84 nm 3 containing about 
350,000 atoms. The samples were equilibrated at 305 K. Sequences of simulations 
were carried out at intervals of Age = 8 J/m 2 . Thus the data given for the ablation 
thresholds are only accurate within about Age/ 2. 

1. Single pulse: Ablation sets in at ge =224.28 J/m 2 and t = 12.6ps with an 
ablation depth of 21.2nm. The sample melts down to a depth of 35.0nm. 

2. Two pulses at a fixed time interval: Ablation is observed at ge = 144.18 J/m 2 
and t = 29.8 ps, thus after the second pulse has arrived. The first pulse melts the 
probe up to a depth of 26.0 nm at the time when the second pulse hits the probe. 
At first the ablation depth is 10.4 nm, but finally the sample melts down to a depth 
of 79.6 nm and a layer with 26.0 nm is ablated. 

The electron and lattice temperature have nearly reached equilibrium when 
the second pulse hits the sample, but the sample is still warm. 

3. Increasing pulse: Ablation sets in at ge = 144.18 J/m 2 and t = 11.9pswithan 
ablation depth of 11.4nm and a melt depth of 34.5 nm. The sample melts up to a 
depth of 54.0 nm and the total depth of the ablated layer is 20.0 nm. 

4. Decreasing pulse: Ablation is observed at ge = 137.17J/m 2 and t = 12.2 ps 
with an ablation depth of 12.2 nm and a melt depth of 34.2 nm. The sample melts 
up to a depth of 51.0 nm and the total depth of the ablated layer is 20.3 nm. 
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At first the evolution of the electron temperature is sharp as for a single pulse, 
but then it continues to rise slowly. This behavior is caused by the inset of the 
second pulse at that time. The lattice temperature behaves similar to the previous 
case. Its rise time is longer than for a single pulse. Obviously, the specific pulse 
shape of the laser beam is smeared out and only its width determines the behavior 
of the lattice temperature. 

The total fluence to achieve ablation is 1,260 J/m 2 for a single pulse, 1,648 J/m 2 
for two separate pulses, 1,439 J/m 2 for the increasing and 1,367 J/m 2 for the 
decreasing pulse. 


Summary of the Simulation of Two-Pulse Sequences 

Two separate pulses require considerably more fluence than a single pulse to achieve 
ablation. Furthermore, ablation is observed only once. There are several reasons for 
this behavior: part of the energy of the second pulse is still absorbed by the ablated 
layer and lost for further heating. The fact that sample is still warm should play a 
minor role. Another reason for the higher threshold is that the energy of the pulses 
add up but ablation is initiated by one pulse only. The total number of ablated layers 
is the similar to the single pulse. The final melting depth is much larger than the 
depth for a single pulse. Obviously the energy added to the system acts as expected. 
To supplement this study the distance between the pulses was varied. If the height 
of the second pulse is between 75 and 87.5 % of the first pulse we find that cie has 
to rise up to about 152.19 J/m 2 to achieve ablation. 

The two cases of overlapping pulses are very similar. The fluence for ablation 
is about 10% higher than for the single pulse. This is due to the lower maximum 
and shows that it is better to concentrate the laser energy into a very short pulse. 
The melting depths are larger than for a single pulse which again reflects the fact 
that more energy has been added to the system. In contrast to a single sharp pulse 
ablation splits up into several layers, but in total the depth is similar to the case of a 
single pulse. 

The results should be compared to the work of Rosanti and Urbassek [12]. The 
pulses they study are half as wide as ours, and they vary the time between the two 
pulses. Unfortunately they do not report melting depths and ablation thresholds. 

Experimental results show that the most effective way to ablate material by multi¬ 
pulses is a decreasing sequence of pulses with very short distance between them, 
almost overlapping. For increasing sequences the small peaks at the beginning do 
not lead to melting or ablation and thus this pulse shape is not as effective as 
decreasing pulses. The most effective pulse shape, however, seems to be a sharply 
rising pulse with a slow decay [19]. 
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Table 1 Pair potentials 


Year 

Machine 

Time (|is/step/atom) 

Cores 

Atoms/core 

1996 

T3E 600 

53 

8-512 

4,630-37,044 

1999 

T3E1200 

38 

128-512 

10,117,414 

2006 

Blue Gene/L 

20 ± 1 

1-2,048 

2,000-128,000 

2010 

XT5 (Jaguar) 

6 

131,072 

1,048,576 

2012 

XE6 

6.5 

512 

16,384-1,048,576 

2012 

XK6 Cuda 

4.7 

16 

16,384-1,048,576 

2012 

XE6QMP 

6.0 

16-512 

16,384-1,048,576 


Table 2 

EAM interactions 




Year 

Machine 

Time (p,s/step/atom) 

Cores 

Atoms/core 

2006 

Blue Gene/L 

36 ±5 

1-2,048 

2,000-128,000 

2010 

Nehalem 

8.3 

2,048 

29,297 

2010 

Nehalem 

9.7 

3,072 

19,531 

2010 

Nehalem 

9.9 

3,584 

16,741 


Ranges indicate that the performance is almost the same within this interval. 


6 Performance and Benchmarks 

General benchmark data for IMD have been given by Stadler et al. [17]. The data 
demonstrate that IMD scales almost linearly in weak scaling (same number of atoms 
per processor) and fairly well for strong scaling (total number of atoms constant, 
thus communication load growing). This behavior is still valid as a systematic study 
on the Blue Gene/L clearly shows (See the previous HLRS report [9]). 

Tables 1 and 2 collect benchmark results for T3E, Blue Gene/L (Jtilich), 
Nehalem, Hermit (XE6/XK6), and Jaguar (XT5). EAM interactions require typ¬ 
ically up to twice as much time as pair interactions. Currently we find that 
IMD is about three times faster on Hermit than on the Blue Gene/L. On Jaguar, 
the degradation of performance going from 1 to 131,072 cores was only about 
11 % for 1,048,576 atoms per core! The total amount of atoms in this case was 
137’438’953’472. IMD is about 8 % faster with the GNU compiler as compared to 
the Cray compiler. If the extra time required for EAM interactions with respect to 
pair potentials is taken into account we find that IMD is currently about as fast on 
the Nehalem as on Hermit. 

The numbers for pair potentials may be not as reliable as those for EAM 
interactions since the former are pure benchmark results from short time runs while 
the latter are results from true production simulations from multi-hour runs. 

In all cases tested up to now, OpenMP lead to a rapid loss of performance, for 
example by a factor of 10 when going from one to two threads on Hermit (pure 
OMP). Cray is currently carrying out benchmarks to figure out how to improved 
the performance with OpenMP. If OpenMP and MPI are used, the performance loss 
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is down to 44% for 16,384 atoms and 11 % for 1,048,576 atoms for 8 OpenMP 
threads. 

The GPU version of IMD is still under development. Currently it can be used 
only for pair potentials and monatomic samples. 

A major issue are the development of inhomogeneities during simulation. The 
performance loss for 60 million atoms was about 75 % after 200,000 time steps. For 
movies the full sample has to be run until the end. For production runs a simple cure 
is available: since the simulation has to be restarted anyways, the uninteresting bulk 
of the sample is cut off after 200,000 steps. 

In general load balancing is a very hard task because the communication setup 
is so central to the code that changing it to improve performance would be almost 
equivalent to rewriting the code. Furthermore, the performance is strongly depen¬ 
dent on the simulated setup. So there is no general recipe. Since the performance 
problem is typically one-dimensional it has been discussed to break up the rigid 
assignment of atoms to equidistant chunks of the simulation box. However, if a 
reordering of the atoms occurs during simulation this could kill all improvements 
since it requires communication of huge amounts of atoms and vast amounts of extra 
memory. 


7 Summary 

We presented results on ablation simulation with homogeneous laser beams in the 
two-temperature model combined with molecular dynamics simulations. We studied 
the influence of electronic parameters, namely heat conductivity, heat capacity and 
electron-phonon-coupling, the question in which cases the TTM must be extended 
to finite electron temperature relaxation times. Finally we presented results on pulse 
sequences and overlapping pulses. 


References 


1. Anisimov S.I., Kapeliovich B.L., Electron-emission from surface of metals induced by 
ultrashort laser pulses, Perel’man T.L., Zh. Eksp. Teor. Fiz. 66, 776 (1974) [Sov. Phys. JETP 
39, 375-380(1974)]. 

2. Biiuerle D., Laser Processing and Chemistry, Fourth Edition, Springer Heidelberg 2011. 

3. Ginzburg V.L., Shabanskiy V.R., Kineticheskaya temperatura elektronov v metallakh i anoma- 
Inaya elektronnaya emissiya, Dokl. Akad. Nauk SSSR 100, 445^-48 (1955). 

4. Hiittner B., Rohr G., On the theory of ps and sub-ps laser pulse interaction with metals I. 
Surface temperature, Appl. Surf. Sci. 103, 269-274 (1996). 

5. Hiittner B., Thermodynamics and transport properties in the transient regime, J. Phys.: Cond. 
Matt. 11, 6757-6777 (1999). 

6. Ivanov, D.S., Zhigilei, V., Combined atomistic-continuum modeling of short-pulse laser 
melting and disintegration of metal films, Phys. Rev. B 68, 064114 (2003). 


Molecular Dynamics Simulations of Laser Ablation in Metals 


117 


7. Kaganov M.I., Lifshits I.M., Tanatarov L.V., Relaxation between electrons and the crystalline 
lattice, Zh. Eksp. Teor. Fiz. 31, 232 (1956) [Sov. Phys.-JETP 4, 173 (1957)]. 

8. KrauBA., Multipulsanregung in Metallen, Bachelor Thesis, Stuttgart (2011). 

9. Roth, J., Trichet, C., Trebin, H.-R., Sonntag, S., Laser ablation of metals, in High Performance 
Computing in Science and Engineering ’10, eds. W.E.Nagel, D.B. Kroner, M.M. Resch, 
Springer Heidelberg, 2011, pp. 159-168. 

10. Lin Z., Zhigilei L.V., Celli V., Phys. Rev. B 77, 075133 (2008). 

11. Roth J., Gahler G., Trebin H.-R., A molecular dynamics run with 5.180.116.000 particles, Int. 
J. Mod. Phys. C 11, 317-322 (2000). 

12. Rosanti Y„ Urbassek H.M., Ultrashort-pulse laser irradiation of metal films: the effect of a 
double-peak laser pulse, App. Phys. A 101, 509-515 (2010). 

13. Sartison M., Characterization of Ablation Properties, Bachelor Thesis, Stuttgart (2011). 

14. Schafer C., Urbassek H.M., Zhigilei L.V., Metal ablation by picosecond laser pulses: A hybrid 
simulation, Phys. Rev. B 66, 115404 (2002). 

15. Sonntag, S., Computer Simulations of Laser Ablation from Simple Metals to Complex Metallic 
Alloys, PhD Thesis, Stuttgart (2011). 

16. Sonntag, S., Trichet, C., Roth, J., Trebin, H.-R., Molecular Dynamics Simulations of Cluster 
Distribution drom Femtosecond Laser Ablation in Aluminum, Appl Phys A 104, 559-565 
( 2011 ). 

17. Stadler, J., Mikulla, R„ Trebin, H.-R., IMD: A software package for molecular dynamics 
studies on parallel computers. Int. J. Mod. Phys. C 8, 1131-1140 (1997). 

18. Sonntag S., Roth J., Gahler F., Trebin H.-R., Femtosecond laser ablation of aluminum, Appl. 
Sur. Sci. 255, 9742-9744 (2009). 

19. Siegel J., Deep ablation in dielectrices with temporally shaped femtosecond pulses, Talk at the 
11th Conference on Laser Ablation, Cancun, Mexico, (2011). 

20. Sheng H., https:/sites.google.com/a/gmu.edu/eam-potential-database/Pb, http://cds.gmu.edu/ 
node/39. 

21. Ulrich C., Simulation der Laserablation an Metallen, Diplomarbeit, Stuttgart (2007). http:// 
elib.uni-stuttgart.de/opus/volltexte/2007/3296/ 


Electronic Surface Properties of Transparent 
Conducting Oxides: An Ab Initio Study 


B. Hoffling and F. Bechstedt 


Abstract We investigate the surface properties of the transparent conducting oxides 
In 2 C> 3 , SnC> 2 , and ZnO using density functional theory and quasiparticle calculations 
based on many-body perturbation theory. We employ the repeated-slab supercell 
method. An energy alignment of valence and conduction states via the electrostatic 
potential is applied to determine ionization energies and electron affinities for 
various surface orientations and terminations of the oxides. In addition, surface 
energies for different orientations of bixbyite In 203 are calculated. We find a strong 
influence of surface orientation and preparation techniques on these fundamental 
quantities. 


1 Introduction 

Transparent conducting oxides (TCOs) like ImCh, SnC> 2 , and ZnO are routinely 
used as transparent electrodes in photovoltaic and optoelectric devices [1] as well 
as in transparent electronics based on doped oxides [2,3], They are transparent in 
almost the entire range of the solar spectrum and usually exhibit a high electron 
conductivity [4—6]. They are also used in silicon (Si) photonics and Si-based solar 
cells [7]. Electronic properties of their surfaces like ionization energy and electron 
affinity are frequently used to predict natural band discontinuities at the interfaces 
with other materials such as Si [8,9]. The existence of surface or interface states 
within the fundamental gap can lead to electron-hole recombination and limit 
the efficiency of the device. Consequently, these parameters are of great interest, 
but due to sample preparation problems, rather poorly known. Modern theoretical 
approaches can help to address these questions. 
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Density functional theory (DFT) is known to underestimate the fundamental 
band-gap of semiconductors, therefore many-body effects have to be taken into 
account correctly to describe the electronic properties of oxides [10-13]. We use 
modern quasiparticle (QP) calculations based on many-body perturbation theory 
[12,14] to predict the electronic bulk properties of the body-centered cubic ( bcc) 
bixbyite as well as the rhombohedral (rh) geometry of I^Ch, the most favored 
rutile (rt) geometry of SnO?. and wurtzite (wz) structure ZnO. We combine the 
results with DFT ground-state calculations to obtain surface energies, ionization 
potentials and electron affinities for various surface orientations and terminations 
of the TCOs. Due to the large cell size required for surface calculations and the 
high computational cost of quasiparticle methods, massively parallel machines are 
required to perform the calculations. 

The underlying theoretical and computational methods are described in Sect. 2. 
In Sect. 3 our results are presented and their reliability discussed in the light 
of available measured values. Finally, in Sect. 4, we give a brief summary and 
conclusions. 


2 Computational Methods 

The ground-state properties of the oxides are determined in the framework of 
DFT [15], using the local density approximation (LDA) [16] for exchange and 
correlation (XC). We employ the XC functional of Ceperley and Alder [17]. The 
calculations for ZnO have been carried out in the generalized gradient approxima¬ 
tion (GGA), employing the PW91 functional to model XC [18]. All computations 
are performed with the Vienna Ab initio Simulation Package (VASP) [19]. The 
projector-augmented wave (PAW) method [20] is used to describe the electron-ion 
interaction in the core region. Usually it allows for the accurate treatment of first-row 
elements such as oxygen and localized semicore states such as ln4 d, Zn3 d, and Sn 
Ad by modest plane-wave cutoffs. The electronic wave functions are expanded into 
plane waves up to cutoff energies of 550 (FhOb), 450 (SnOa), and 500 eV (ZnO), 
respectively [10-14], 

Brillouin-zone (BZ) integrations are carried out as summations over special 
points of the Monkhorst-Pack (MP) type [21]. Monkhorst-Pack meshes of 5 x 5 x 5 
(cubic) or 8 x 8 x 8 (rhombohedral) k-points are found to be sufficient for E 12 O 3 
[10]. For hexagonal ZnO a 12 x 12 x 7 mesh is applied [12]. In the rt -SnOa case, 
we use a mesh of 8 x 8 x 14 k-points [13]. 

All calculations were carried out on the NEC SX-9 system, on which both the 
scaling behavior and the performance per CPU for our code are most efficient (see 
Ref. [22] for details on performance and scalability). 

By minimizing the total energy obtained within DFT-LDA or DFT-GGA we 
obtain the cubic {ad) and non-cubic {a, c ) lattice constants for bulk TCOs. For 
ln 203 we find values ao = 10.09 A [ 10 ] (experiment: 10.12 A [23]) and 5.48 A 
[ 10 ] (experiment: 5.49 A [23]) for the bcc and rh structure, respectively. In the case 
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of rf-Sn0 2 we get a = 4.74 A and c = 3.20 A [13] (experiment: a = 4.74A and 
c = 3.19 A [24]), and, finally, for wz -ZnO we observe a = 3.28 A and c = 5.28 A 
(experiment: a = 3.25 and c = 5.2 A [24]). Except for c for rt-SnOi the lattice 
constants differ from the corresponding measured values by less than 1 %. 

The structural parameters are then used for the calculation of the QP band 
structures [14, 25], The QP equation for the electronic self-energy is solved 
pertubatively on top of the self-consistent solution of a generalized Kohn-Sham 
(gKS) equation using Hedin’s G W method. In the zeroth approximation we describe 
the GW self-energy by the spatially non-local XC potential Fxc(x, x') described by 
the hybrid functional HSE06 of Heyd et al. [26-29], The QP shifts for the gKS 
eigenvalues are computed within the Go Wo approach [30]. It has been demonstrated 
that for the studied compounds this treatment leads to energy gaps in excellent 
agreement with measured values [10, 12-14, 31]. 

The surface calculations are carried out using the repeated slab supercell method. 
The number of layers in the slab is 9 (fecc-In 2 0 3 (00 1)), 8 (Z>cc-In 2 03(l 10)), 11 
(/7z-In 2 0 3 (001)), 8 (rf-SnO 2 (001)), 19 (rf-SnO 2 (100)), 6 (wz-ZnO(lOlO)), and 20 
(wz-ZnO(OOOl)), with 12 A of vacuum each. Usually orthorhombic slabs are applied 
resulting typically in N x N x 1 MP meshes, with N = 3, 8, 8, and 12 for (?cc-ln 2 0 3 , 
/■/?-ln 2 0 3 , SnO 2 (001), and ZnO, respectively. The SnO 2 (100) slab is treated using 
anMP mesh of 8 x 14 x 1. For polar directions, i.e. frcc-In 2 0 3 (001), SnO 2 (001), and 
ZnO(OOOl), we encounter the fundamental problem of a net dipole moment within 
the slab, and the additional difficulty that a slab with two non-equivalent surfaces 
does not allow the computation of surface energies. To get around these obstacles 
we employ symmetric slabs by breaking the stoichiometry within the supercell 
and adding an additional layer of oxygen or, alternatively, metal atoms. Because 
of different bond lengths in the [0001] and the [0001] direction, respectively, the 
ZnO(OOOl) slab cannot be symmetrized this way. We introduced a central Zn-layer 
and constrained the bond lengths to create a symmetric slab, thereby creating 
an unphysical strain in the center of the slab. Since the lateral cell-size of the 
(0001)- 1 x 1 slab is small, we make the slab thick enough to obtain a converged 
electrostatic potential exhibiting bulklike oscillations and a plateau in the vacuum 
region. However, due to the additional unknown strain we cannot calculate surface 
energies for this surface orientation. 

To align the QP band levels in the slabs one needs to determine the electrostatic 
potential U(x) acting on the electrons. It can be derived from the effective single¬ 
particle potential occurring in the Kohn-Sham equation [16] or the generalized 
Kohn-Sham equation [14] and is defined as the local part of the electron-ion 
interaction described by the pseudopotentials and the Hartree potential of the elec¬ 
trons. This quantity is independent of local (LDA), semi-local(GGA) or non-local 
(HSE06) description of the exchange-correlation part of the effective single-particle 
potential and therefore well suited to serve as a universal reference level. 

The ionization energy 


I = E vac - Et 


(1) 
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Fig. 1 Planar average of the electrostatic potential V (z) along the cubic axis z| | [1010] for the 
ZnO(lOlO) slab (black) and for the ZnO bulk calculations (red). The QP conduction band minimum 
E c and valence band maximum E v are shown as dotted lines. The ionization energy / and electron 
affinity A are indicated. The vacuum level is used as energy zero 


and the electron affinity 

A = E wac - E c (2) 

are defined as the energy difference between the vacuum level E vac , i.e. the 
electrostatic potential as seen by an electron in the vacuum, and the valence band 
maximum (VBM) E v and the conduction band minimum (CBM) E c , respectively. 
Hence, to obtain QP values for I and A, one has to align the QP bulk band 
structure with the electrostatic potential as obtained through the DFT-LDA slab 
calculations. As an example, the alignment for the ZnO(lOlO) surface is shown 
in Fig. 1. The plane averaged electrostatic potential V(z) shows the characteristic 
atomic oscillations in the area of the slab and reaches a plateau in the vacuum 
region. By aligning the atomic oscillations in the slab with those obtained via the 
bulk calculation one derives / and A. 

Computing / and A using this method we come up against a theoretical problem 
in the QP description. The GW approximation sets the vertex function f for the 
calculation of the XC self-energy E = GW I' to /’ = 1. It has been shown that 
the inclusion of vertex corrections changes the position of the SiC >2 VBM by 0.6 eV 
while the gap is closed by 0.3 eV, so that / and A are reduced by 0.6 eV and 0.3 eV, 
respectively [32]. Therefore, a variation of the band edges of about 5-10% due to 
further many-body effects cannot be excluded. 

The surface energy £sf is defined as the energy cost for creating the surface. For 
stoichiometric cells with symmetric slabs with surface area A containing N bulk 
unit cells it can be easily calculated by 


£sf = 


T s |ah X/Aulk 

2 A 


(3) 


where E s i a b and £buik are the calculated total energies for the slab and the bulk unit 
cell respectively. For non-stoichiometric slabs we have to generalize the formalism 
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to include the chemical potential /z cat and /iq of the cation (i.e. In, Sn, or Zn) and 
oxygen atom, respectively [33], For a slab with N cat cations and No oxygen atoms 
Eq. (3) then turns into 

EsF = Ya [^ slab (^ cat ’ N o) — ^VcatMcat “ Nq^o] ■ (4) 

Since the bulk acts as reservoir, the chemical potentials of the atoms are related by 

^catMcat 4” noHo — /Tbulk* (5) 


where n C at and no denote the number of cations and oxygen atoms per formula unit, 
respectively, /fbuik = -Ebuik/n U nit is the chemical potential per formula unit, and n un i t 
the number of formula units per bulk unit cell. Substituting Eq. (5) into Eq. (4), we 
obtain 


Esf 


1 

2 A 



N C at 

M bulk i 

^cat 


/ np 

V^cat 




(6) 


This enables us to determine surface stabilities in dependence on the chemical 
potential of the oxygen atom, generally given in relation to jio 0 , the chemical 
potential of the free CF molecule as obtained via a LDA total energy calculation 
of a free oxygen molecule. 


3 Results and Discussion 

We determine the ionization energies and electron affinities of the TCOs for 
different surface orientations and terminations. The results are listed in Table 1 and 
displayed in Fig. 2. 

There are only a few measurements of the surface properties of 1^03 and 
Sn-doped-ImCh (Indium-Tin Oxide, ITO). The electron affinity seems to vary 
in the range of A = 4.1...5.0eV in dependence on the doping concentration 
(see Ref. [34] and references therein). Together with the measured gap of 3.6 eV, 
ionization energies of I = 7.7.. .8.6 eV may be derived. Klein [35] suggested values 
of A = 3.5 ± 0.2 eV and I = 7.1 ± 0.15 eV for evaporated kb 03 films. In a more 
recent paper [36] the same author gave values of A = 4.45 eV and / = 8.05 eV 
for ITO samples. Our theoretical values indicate a strong influence of the surface 
orientation and termination. The oxygen-terminated (OOl)-surface differs from the 
indium-terminated surface by more than 3 eV. To illustrate, the planar average of 
the respective electrostatic potentials are shown in Fig. 3a. The atomic configuration 
of the different surface terminations is shown in Fig. 3b, along with isosurfaces of 
the electrostatic potential, to illustrate the different surface barriers. This effect is 
due to the strong influence of the direction of the surface dipole in highly polar 
materials like the TCOs. Dangling bonds located at the oxygen atoms will most 
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Table 1 Characteristic energies: fundamental gap E g , electron affinity A and ionization energies 
I of transparent conducting oxides derived from QP calculations. All values in eV. The surface 
orientation used for the calculation of / and A is indicated by the Miller indices (hkl) or ( hkil ). 
Experimental values are given in parentheses 


Crystal 

Orientation 

E g 

A 

I 

/7t-In203 

(0001) 

3.79 

6.11 

9.41 



(3.02)“ 



6cc-In2C>3 

(001) In-terminated 

3.15 

6.10 

9.25 


(001) O-terminated 


9.22 

12.37 


(110) 


5.30 

8.45 



(3.58)“ 

(3.5-5.0)* 

(7.1-8.6)* 

;7-Sn02 

(100) 

3.64 

4.10 

7.73 


(001) Sn-terminated 


3.45 

7.08 



(3.6 f 

(4.44)“' 

(8.04 

wz-ZnO 

(0001) Zn-terminated 

3.21 

5.07 

8.24 


(1010) 


3.65 

6.84 



(3.38) e 

(3.7—4.6, 4.05 

(7.1-8.0, 7.45, 




4.42, 4M) d -te’ h 

7.82, 8.04)*/•*■* 


“References [37,38] 
^References [35, 36] 
“Reference [46] 
d References [41,42] 
“Reference [45] 
Reference [43] 
^Reference [39] 

* Reference [40] 



(1) (2) (3) (4) (5) (6) (7) (8) 


Fig. 2 Valence band (red) and conduction band (green) edges for (1) / 7 ?-In 203 ( 0001 ), (2) bcc- 
ImOsCOOl) O-terminated, (5) focc-ImC^OOl) In-terminated, (4) fecc-In20i(l 10), (5) SnO2(100), 
(6) SnC>2(001) Sn-terminated, (7) ZnO(OOOl) Zn-terminated, and (8) ZnO(lOlO). The vacuum 
level is used as energy zero 

likely increase polarity and, hence, the surface dipole of the slab. On the other 
hand, the In-termination will decrease the dipole, lowering the surface barrier 
for electrons. All in all, our predictions seem to overestimate the experimental 
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a b 



z (A) 


Fig.3 (a) Planar average of the electrostatic potential V (z) for the In-terminated {black) and 
O-terminated {red) &cc-In2C>3(001) surface, (b) Surface structure of the O-terminated {above) and 
In-terminated {below) icc-Ir^ChtOOl) surface. Isosurfaces of the spatially resolved electrostatic 
potential are shown for V(x) = — 3 eV 


findings. The discrepancies to the largest experimental values are of the order of 
0.5 eV. Apart from uncertainties in the theoretical description, such as the neglect of 
vertex corrections in the GW approximation, several problems of the real-structure 
surfaces such as doping influence, surface coverage (and, hence, surface dipole), 
and sample quality may occur. Also the gap value of 3.6 eV taken from optical 
measurements deviates by 0.5 eV from the recently predicted one [37], mostly due 
to the fact that the lowest interband transitions are dipole-forbidden in the bixbyite 
structure [38]. 

There are several measurements regarding the surface barrier of wz- ZnO, which 
vary over a wide range. Jacobi et al. [39] found electron affinities of A = 3.7, 4.5, 
and 4.6 eV depending on surface orientation and termination. Other electric mea¬ 
surements yield an electron affinity of A = 4.64 eV [40]. A value of A = 4.05 eV 
is derived from studies of the semiconductor-electrolyte interface [41] which yields 
/ = 7.45 eV taking into account the known gap [42]. Another measurement gave 
I = 7.82 eV [43]. Again, it seems that we overestimate the measured values. 

The surface properties of SnCT are hardly known. Measurements gave 
A = 4.44eV [41] which, in combination with the gap of 3.6eV measured for 
rt- S 11 O 2 [42], yields an ionization energy of I = 8.04 eV. For tetragonal SnC> 2 , a 
variation in the interval / = 7.9-8.9eV, depending on Sb doping, is reported [44]. 
SnC >2 is therefore the only TCO where our predictions seem to underestimate the 
experimental value. This might be due to a possible influence of virtual electronic 
states in the fundamental gap [8]. 

The large variety of measured values for / and A of the TCOs is probably not 
only due to sample quality problems, but also to the strong dependence of the 
surface barrier on surface orientation and termination. To investigate the influence 
of the surface preparation on the surface energy of the different orientations and 
terminations we calculate the surface energy of important surfaces for the bcc 
geometry of In 203 . 
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Fig. 4 Surface energiy E sp per unit area of the non-polar (110) ( blue), the In-terminated (001) 
(black), and the O-terminated (001) (red) surface of focc-ImC^ in dependence of the chemical 
potential /to of the oxygen atom. The energy zero is set to half the chemical potential of a free O 2 
molecule 


The surface energy is plotted in dependence on the chemical potential of the 
oxygen atom in Fig. 4. For the polar (001) direction we see a strong dependence of 
the surface energy on the chemical potential // () of the O atom. While in the oxygen- 
rich limit of /io = 1 / 2/xq 2 the O-termination is favored over the In-termination by a 
factor of 2, this changes drastically in oxygen-poor environments. The In-terminated 
surface becomes equally stable or even energetically favored over the oxidized 
surface. The non-polar (110) surface with a surface energy Esf = 0.11 eV/A 2 is 
energetically favored over both (001) phases in practically all settings. We therefore 
conclude, that the ionization energy and electron affinity of bcc-lnoO^ strongly 
depends on the preparation conditions of the sample. 


4 Summary and Conclusions 

We have investigated the electronic properties of the transparent conducting oxides 
ImC^, SnC >2 and ZnO by means of quasiparticle calculations based on many-body 
perturbation theory. The resulting band structures with rather accurate fundamental 
band gaps were combined with density functional theory calculations of material 
slabs to investigate electronic surface properties for different surface orientations 
and terminations. For this purpose the bulk and surface electronic structures have 
been energetically aligned using the electrostatic potential as reference. 

The results were compared with the few experimental data available. Even 
though the experimental values are at variance, we found nevertheless a slight 
but systematic overestimation of ionization energy and electron affinity in our 
predictions. This is most likely due to the omission of vertex corrections in the 
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quasiparticle calculations. The only exception to this rule was S 11 O 2 , possibly due to 
the influence of states within the fundamental gap at real surfaces. We found a strong 
dependence of the surface barrier on the orientation and termination of the surface. 
This is caused by the strong surface dipoles in highly ionic compounds. 

We also analyzed the influence of the chemical potential of oxygen on the surface 
energy of different surface orientations and terminations of bcc-lnoO^. We found 
that the surface stability, of different phases and, hence, the surface barrier of the 
sample, is strongly dependent on the preparation conditions. 
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Surface Magnetism: Relativistic Effects 
at Semiconductor Interfaces and Solar Cells 


U. Gerstmann, M. Rohrmiiller, N.J. Vollmers, A. Konopka, S. Greulich-Weber, 
E. Rauls, M. Landmann, S. Sanna, A. Riefer, and W.G. Schmidt 


Abstract Ab initio calculations of the electronic g-tensor of paramagnetic states 
at surfaces and solar cells are presented, whereby special emphasis is given onto 
the influence of relativistic effects. After discussing the numerical requirements for 
such calculations, we show that for silicon surfaces the g-tensor varies critically 
with the hydrogen coverage, and provides an exceptionally characteristic property. 
This holds also in the case of powder spectra where only the isotropic part g av 
is available from experiments. Extending our calculations onto microcrystalline 
3C-SiC, our study explains why sol-gel grown undoped material can serve as an 
excellent acceptor material for an effective charge separation in organic solar cells: 
Due to an auto-doping mechanism by surface-induced states it fits excellently into 
the energy level scheme of this kind of solar cell and has the potential to replace the 
usually used rather expensive fullerenes. 


1 Introduction 

Solar cells provide an increasing market with high potential for a further devel¬ 
opment. The global market for photovoltaics cells is expected to be doubled 
during the next decade. However, overcapacities will be an ongoing challenge 
for the manufacturers and a wedding-out and consolidation process seems to be 
unavoidable. One way out is the production of highly efficient solar cells by minimal 
costs. So far, however, such an optimization of the cells is mainly based on try 
and error. For a further improvement of cell performance a better understanding 
of the microscopic structures and the basic electronics behind the light-induced 
separation of charge carriers as well as the efficiency limiting processes is crucial. 
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Experimentally, electron paramagnetic resonance (EPR) provides a powerful tool 
to analyse the microscopic structure of paramagnetic systems. Since most of the 
electrically active centers in solar cells are those with unpaired electrons, these 
electronic states are paramagnetic. Hence, EPR provides an appropriate possibility 
to characterize the basic material as well as the solar cells itself. In many cases, 
however, the wealth of important information available from EPR measurements, 
cannot be extracted from experimental data alone. For an identification of the 
microscopic structures, accurate first principles calculations of as many as possible 
relevant properties are necessary to calculate a fingerprint of the structure that can 
be compared with the experimental values. From EPR experiments the components 
of the electronic g-tensor are available also in those cases in which hyperfine (hf) 
splittings cannot be resolved. However, in contrast to the ab initio calculation of 
hf splittings that already do have an appreciable history, quantitative predictions of 
electronic g-tensors making use of the machinery of ab initio density functional 
theory (DFT) have become possible only very recently [1]. In semiconductors, 
this has been already demonstrated successfully for defects in SiC and GaN bulk 
material [2-4]. For surfaces, however, theoretical data obtained by first principle 
calculation is very rare. In this work we show, that the EPR parameters are mainly 
influenced by relativistic effects like the spin-orbit coupling (SOC). At surfaces 
and interfaces these effects become exceptionally anisotropic giving rise to special 
effects like the so-called Rashba effect [5]. 

We evaluate our method using hydrogented silicon surfaces as an example. 
Such surface states appear in hydrogenated microcrystalline silicon {pc- Si:H), a 
material that can be used for efficient and low-cost solar cells [6] (see Fig. I). 1 
We show that the ab initio calculation of g-tensors can help to elucidate the 
situation in such microcrystalWe calculate the elements of the electronic g-tensor 
for some paramagnetic states at silicon surfaces from first principles, using a 
recently developed gauge-including projector augmented plane wave (GI-PAW) 
approach [ 1,9] in the framework of DFT. According to the in-diffusion of water and 
atmospheric gases [10] we investigate the EPR fingerprint of those paramagnetic 
states that are created by hydrogen adsorbed at Si(l 11) and Si(001) surfaces. Our 
approach is shown to be able to distinguish between different surface states [11], 
For silicon surfaces with different hydrogen coverages, the g-tensor is by far more 
characteristic than the hf splitting of the Si dangling bonds or that of the adsorbed 
H atoms. This holds in cases of powder spectra like in the case of amorphous or 
microcrystaline material for solar cells where only the angular average of the spectra 
is available experimentally. 

A central challenge on the way to optimized solar cells is to make the thickness 
of the individual layers smaller than the diffusion length of the charge carriers. 
Recently, 3C-SiC microcrystals grown by a sol-gel based process have been 


1 In comparison with cells based on amorphous silicon suffer less from the notorious light-induced 
degradation, known as the Staebler-Wronski effect [7], Best cell perfonnance is, however, achieved 
for material grown close to the transition to amoiphous growth [8]. 
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Fig. 1 From microcrystaline material (here: atomic structure of oxidized /zc-Si) towards single 
solar cells and photovoltaic plantages 


proposed as a promising acceptor material for photovoltaic applications [27]. Such 
/zc-SiC samples have been already characterized by optical spectroscopy and elec¬ 
tron paramagnetic resonance (EPR) [32]. In this work, the available experimental 
data is analyzed with the help of ab inito DFT calculations resulting in electronic 
band structures and g-tensors. Based on this, a possible scenario for the observed 
acceptor process is discussed. 


2 Methodology 

Our first-principles calculations of the EPR parameters are based on density 
functional theory (DFT) using the generalized gradient approximation for the 
electron exchange and correlation functional (GGA-PBE) in its spin-polarized 
form [12]. The hyperfine splittings, i.e. the interaction of the magnetic moments of 
the electrons with those of the nuclei, are calculated taking into account relativistic 
effects in scalar-relativistic approximation [13, 14]. Although there exist a non- 
relativistic derivation for the isotropic contact interaction by Fermi [15], Breit has 
shown that the origin of the hyperfine splitting can be only described correctly in a 
relativistic treatment [16]. The static magnetic field caused by the magnetic moment 
fi l = g v/Tv I of a nucleus with gyromagnetic ratio g\v located at the origin is 
included using the vector potential for this magnetic field 
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A/(r) = Vx(^ (1) 

replacing the momentum operator p by the canonical momentum jt = p + e/cAj 
in Dirac’s equation 


(ca • p + P me 2 + V — E re i ) 0 = 0. (2) 

The influence of the resulting magnetic fields B/(r) = V x A/(r) leads to level 
splittings in the 10“ 12 ... 10 _2 eV range. The smallness of these splittings allows 
a simplified computation via perturbation theory. Within first order perturbation 
theory the expectation value of the hyperfine interaction is given by 

£hf = ~e(0\ct-Aj(r)\0) , ( 3 ) 

Here, a is a 4 x 4 matrix constructed from the 2x2 Pauli spin matrices a x , a y , and 

/ 0, \ 

a-, respectively, whereby | 0) = I 1, is given by the Dirac spinor decomposing 

V ) 

into the two-component Pauli spinors 0/ and @s • For light atoms, 0/ is the large 
component whereas <Ps turns out to be small. This leads to 

Erf = -e ({0 L \a ■ A(r)\0 s ) + {0s\<* • A(r) \<P L )) ■ (4) 

Thus, £hf is a genuine relativistic term that couples large and small components of 
Dirac’s equation. The small component 0s can be expressed in terms of the large 
component 0i as 


0s 


cap 

2 me 2 + E — F(r) 


&L = 


son 

2m c 2 


(a ■ p) 0 L , 


(5) 


whereby S(r) is the inverse relativistic mass correction. This can be used to express 
0s in terms of <£>/ , leading to expectation values containing the large component 
exclusively [17]. In the case of orbital quenching we obtain [14]: 


Sic 

^contact =- yPB {0L \S(r)fl, ■ ff<5(r)| 0 L ) 


+ 0L 


1 ds 2 

^4 ■ ar - (^/ • r) (a ■ r)J 


0i 


^dip — l~^B ( &L 


— [a fL,r 2 -3 (a • r) (/t 7 • r)] 


0L 


(6) 

(7) 


The dipolar term E,i lp is angular dependent and, thus, in the general case gives rise 
to anisotropic hf tensors. In the non-relativistic case, since S(r) —> 1, only the first 
term in (6) contributes to the isotropic part, the so-called contact term. By this. 
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we obtain the results of the classical theory given by Fermi [15], that only the 
probability amplitude at the nucleus contributes. In the relativistic case, however, 
this first term does not contribute at all. It is second term in (6) which becomes the 
relativistic analogue to the contact interaction. For a pure Coulombic potential 

-Ze 1 

V(r) = - (8) 


the derivative 3 S(r)/dr is similar to a broadened 3-function 


3 T h(r) 


1 3 S 

Arcr 2 dr 


1 W 2 


Anr 2 


1 + r + r Th /2 

2 me 1 J 


(9) 


In other words, the magnetization density of the electron in the relativistic theory 
is not evaluated at the origin, where it would be divergent for s electrons, but is 
averaged over a sphere of radius 


Ze 2 
me 2 


( 10 ) 


which is the Thomson radius, about ten times the nuclear radius. 

As a result, the divergence of the s electrons presents no problem. Also if we 
approximate the nuclear potential by that of a charged volume rather than that of 
a point charge [18], the divergence already disappears. However, it is important to 
note, that we would obtain divergent contact terms mixing the approximations, e.g. 
using (scalar) 2 relativistic orbitals in a non-relativistic formula. 

Whereas the hf splittings depend on the magnetization density m{r) = /; 1 (r) — 
n*(r) exclusively, the main deviation of the g-tensor from its free electron value 
g e ss 2.002 319 278 is given by the spin-orbit coupling of the many-particle 
system. In physically transparent form it can be written in terms of spin-polarised 
electron currents j 11,-/1 and j^’ /i induced by a unit magnetic field IV' applied along 
the direction ji\ 


Ag 


so _ 

flV 


a 

2~S 



W eff X J 


(1).M 


(r )d 3 


( 11 ) 


To obtain this result we start again from Dirac’s equation and apply perturbation 
theory with respect to spin-orbit coupling and to an external magnetic field B. 


2 In the scalar relativistic treatment is caculated solving Dirac’s equation but thereby ignoring 

spin-orbit interactions. This leaves the electron spin as a “good” quantum number. Already in a 
scalar relativistic treatment, s-like wave functions diverge at the nuclear site (if the nucleus is 
taken to be a point charge). 
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a is the fine structure constant, S the total spin given by the number of unpaired 
electrons times 1 s. V denotes the gradient of the spin-polarized effective 
potential. Besides ground state quantities the evaluation of the g-tensor requires 
the calculation of the spin-currents in linear magnetic response [9]: 

jO^r) = 2Y J Re{f^/ p ^ a (So)^a)\f( 0) . o )+ \ « CT (r)-B" xr (12) 


/ p = - | r) ( r| + c.c. denotes the current operator for vanishing magnetic field. 

11 = f L • B'' describes the influence of the uniform magnetic field determining 
the perturbed wavefunctions \ 0 ) = c S a (s 0 ) ^(i)|iA(o) 0 ) y i a Green’s function 

of the unperturbed system 


y°(e) = Y, 

e 


l^(0),e ^(0),e 


(13) 


whereby the sum runs over the empty orbitals e. 

Strictly seen, the formalism so far presented ensures only a faithful description 
of the nuclear and two-electron spin-orbital coupling. According Ref. [1,11] higher 
order contributions can be approximately taken into account via the spin-other-orbit 
correction , given by the screening B (1),/1 (r) of the external magnetic field B ,( by the 
induced currents as experienced by the magnetization density m( r) of the unpaired 
electrons: 

Ag S °° = ff / e„ •B <1 ^(r)m(r) d 3 r. (14) 

In the case of the paramagnetic states at Si surfaces the contribution of the SOO 
term comes out to be very small (clearly below 10 ppm). In other words, the spin- 
other-orbit contributions do not contribute considerably to the g-values given in 
Tables 1-4. 

For a modelling of the surfaces we use supercells and periodic boundary 
conditions. Hence, the explicit treatment of an external magnetic field B has to 
be done in a gauge-invariant way in order to retain the translation invariance of 
the wavefunctions. Here, the gauge-including projector augmented plane wave (GI- 
PAW) approach sattisfies this requirement and allows an ab initio calculation of the 
all-electron magnetic response using an efficient pseudopotential approach [1,9]. 
The GI-PAW approach is implemented in the pwscf-code (QUANTUM-ESPRESSO 
package) [19] and has been already applied succesfully to identify paramagnetic 
defect structures in SiC and GaN [2-4]. 

To model the semiconductor surfaces (silicon and SiC) at least eight atomic 
layers are treated in a supercell, whereby the lowest layer was saturated with 
H atoms. To ensure a well defined transition to bulk material, the atoms in the 
lowest three layers were kept on their ideal bulk positions. All other atoms were 
allowed to relax freely. To minimize the interaction of the periodic images of the 
surface, 10 A vacuum is inserted. We use supercells containing up to 175 atoms and 
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Table 1 Largest hf splittings (MHz) of a H vacancy at Si(l 11):H surface as calculated by ab initio 
DFT. 6 denotes the angle between the principal axis of the hf tensor and the surface normal. All hf 
splittings due to H atoms are below 2 MHz 


# nuclei 

A , 

^2 

4 3 

e 

Sii 1 

-220.0 

-220.0 

-414.9 

o 

o 

d 

Si 2 3 

1.4 

0.0 

-6.7 

52.1° 

Si 3 3 

-26.5 

-21A 

-42.8 

O 

4^ 

o 

SU 3 

-20.1 

-20.4 

-26.7 

66.3° 

Si 5 3 

-7.5 

-7.7 

-10.4 

10.0° 

Table 2 k -point convergence of the g-values calculated for the H vacancy at Si(lll):H surface. 

Given are the principal values g, 

as well as the angular averaged value g av . 6 denotes the angle 

between the principal axis g 3 and the surface normal 




k -point mesh g av 

g\ 

gi 

gi 

e 

r 2.010887 

2.01250 

2.01250 

2.00766 

o 

o 

d 

2x 2 x 1 2.006393 

2.00925 

2.00925 

2.00068 

o 

o 

d 

3 x 3 x 1 2.006607 

2.00939 

2.00939 

2.00104 

o 

o 

d 

4x 4x 1 2.006630 

2.00939 

2.00939 

2.00111 

0 

o 

d 

5x 5x 1 2.006630 

2.00939 

2.00939 

2.00111 

o 

o 

d 

6x6 x 1 2.006630 

2.00939 

2.00939 

2.00111 

o 

O 

d 

Table 3 Comparison of the calculated g-tensors for a single adsorbed H-atom and for a nearly 

complete H-coverage of the Si(001) surface. For 

a better comparison with the calculated hf 

splittings (MHz) the calculated Ag-values are here given in ppm 



A 

1 ^2 Aj 

^hf Ag av 

Agi Ag 2 

Ag 3 e 

(111): H saturated with H vacancy — 

220 -220 -415 

0° 4,309 

7,070 7,070 

-1,213 0.0° 

(001): single adsorbed H atom — 

189 -189 -373 

18° -2,149 

-2,079 -219 

-4,139 27.8° 

(001): monohydride with H vacancy — 

254 -2,554 -450 

20° 2,941 

5,841 3,431 

-449 33.5° 

Table 4 Analysis of 3C-SiC microcrystals: Calculated g-tensor values for different surface related 

defects visualized in Fig. 5. The value experimentally observed for a 

/Ltc-SiC powder spectrum and 

the corresponding angular averaged theoretical values g a „ are also given 


Defect 

Sav 

gi 

gi 

gi 

Exp. 

2.0073 




(a) Si(001):H 

2.00320 

2.00187 

2.00359 

2.00415 

(b) C(001):H 

2.00292 

2.00245 

2.00273 

2.00357 

(c) Si(001) + H 

2.00268 

2.00238 

2.00271 

2.00294 

(d) C(001) + N c 

2.00271 

2.00180 

2.00299 

2.00333 


norm-conserving pseudopotentials with a plane wave energy cutoff of 30 and 50 Ry 
for silicon and SiC, respectively. The ab initio calculation of the g-tensor is still very 
time-consuming: Whereas for the geometry optimisation a 2 x 2 x 1 Monkhorst-Pack 
(MP) [20] A:-point set comes out to be sufficient, to obtain converged estimates for 
the g-tensor in the general case 4x4x1 samplings come out to be unavoidable. In 
some cases the number of Appoints can be reduced by symmetry, but the number of 
(/c-|-(/)-points has to be multiplied by a factor of 7 in order to obtain the derivatives in 
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the reciprocal space. A second requirement for the calculation of the spin-currents in 
linear magnetic response [9] is calculation of the Green’s function (see Eq. 14). As a 
result, the calculation of a g-tensor becomes computationally extremely demanding 
and takes about an order of magnitude more CPU time than structure optimization. 
On the other hand, the calculation of the hyperfine splittings can be done on the fly. 
Given that structure optimization takes 1 day CPU time, and the g-tensor more than 
10 days, the hf splittings are already available after 10 CPU minutes. 


3 Results 

We first discuss the H vacancy at a Si(lll):H surface as a reference system. 
This structure is obtained if removing one H-atom from an otherwise completely 
hydrogenated Si(l 11) surface. It provides a simple model for a single paramagnetic 
dangling bond. Figure 2 shows several views of the microscopic structure and the 
corresponding magnetisation density m (r) . As can be seen from the top view (lower 
left corner), the structure shows perfect C^ v symmetry with the symmetry axis along 
the surface normal. An analysis of m( r) at the nuclei leads to the hyperfine splittings 
given in Table 1. As intuitively expected, the by far largest hf splitting (—415 and 
—220 MHz for the magnetic field along and perpendicular to the surface normal, 
respectively) is due to the unsaturated Si-atom at the surface. The right part of Fig. 2 
shows the magnetization density in a plane parallel to the surface normal. It can be 
considered as a typical ‘textbook’ fingerprint of a dangling bond. As an additional 
feature, weaker accumulations of m( r) are found along the Si zig-zag line. As a 
result, besides that of the dangling bond nucleus itself, our ab initio calculations 
predict further characteristic hf splitting. With a value of about —43 MHz due to 
three equivalent nuclei Sg in the third layer, the hf splitting could be large enough 
to be resolved in EPR measurements. In contrast, the hf splittings below 10 MHz 
(see Table 1), especially that of the H atoms at the Si( 111) surface (below 2 MHz) 
are too small to be resolved. They will contribute to the width of the central line 
instead. The position of this central line is determined by the g-tensor. 

In Table 2, the calculated principal g-values are compiled for different A'-point 
samplings. At least for the 4x4x1 and larger samplings the values can be 
considered converged. The vanishing angle 0 between g :! and the surface normal 
confirms again the perfect alignment of the dangling bond along the surface normal. 
Perpendicular to the surface normal with gi = g 2 = 2.00939 comparatively large 
g-values are predicted. The g-value parallel to the surface normal remains similar 
to that of the free electron. The reason for this particular anisotropic shape is the 
perfect alignment of the dangling bond along the surface normal resulting in a 
strongly anisotropic spin-orbit coupling similar to the Rashba-type [5] characterized 
by vanishing spin-orbit coupling for the spin along the surface normal. 

We also take this textbook dangling bond as an example to evaluate the 
computational efficiency of our code in massive parallel application. We analyse the 
hydrogen vacancy at the H-terminated Si( 111):H surface in a 119 atom supercell by 
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Fig. 2 Microscopic structure and magnetisation density of the paramagnetic H vacancy at 
Si(lll):H surface for different side views (lower left corner: top view). Plot of the magnetisation 
densitywithin a plane including the surface normal. The arrows describe the direction of the 
magnetization density (i.e. the principal axis of the hf tensor) at the Si-nuclei 


using a 6 x 6 x 1 k -point set. Due to the trigonal C 3 ,, symmetry this results in an 
explicit treatment of 24 k -points. In the standard mode, the QUANTUM-ESPRESSO 
code uses parallelization with respect to reciprocal lattice vectors. For Fast Fourrier 
Transformations (FFT) and a limited number of processors, already this simple 
treatment often provides a good scaling with the number of quasi-processors (cores) 
used. However, especially in cases where the number of lattice vectors does not fit to 
the number of cores and for more than 32 cores the performance of the calculations 
becomes more and more saturated (see Fig. 3). We cannot exclude that the observed 
saturation is also supported by additional communication between the several nodes 
as unavoidable for N core s > 32. However, for band structure calculations, in general 
for calculation with a large number of k -points ’ a second parallelization via pools 
can be established. Here, the k -points are devided into pools whereby each pool 
contains a subset of the A:-points. By this, for the present architecture we obtain 
linear scaling for up to 256 processors (16 nodes). 

The number of pools is obvious limited by the number of k -points in the 
calculation and more critical by the memory consumption per core: By doubling the 
number of pools the required memory per core is increased by a factor of 1 . 2 - 1 .4. 
The best scaling, however, is obtained for metallic systems with several thousands of 
Appoints. On the other hand, for a small number of cores per pool the parallelization 
via Appoints can become counter-productive. For a given system size (including cell 
size, number of electrons/bands, number of A:-points) and for a chosen number of 
cores an optimal number of pools exists, mostly larger than the available memory 
allows. In other words, the overall limiting factor is given by the memory available 


3 For spin-polarized calculations, the second spin channel is realized by doubling the A:-point set. 
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Fig. 3 CPU time on the 
HLRS CRAY XE6 for a 
self-consistent field 
calculations (blue) and 
g-tensor calculations (red). 
For exact description of the 
used parameters see text. The 
number of pools is marked by 
the symbols plus, cross, star, 
square corresponding to 1, 2, 
4 and 8 pools, respectively 



per core. Here, for future work, especially for larger systems with up to thousand 
atoms, an architecture with more than 2 GB RAM per core would be desirable. 

Coming back to the physical results of our calculations, the situation becomes 
more difficult in case of the Si(001) surface that has an appreciable history of both 
experimental and theoretical work (see e.g. Ref. [21] for a review). It shows the 
famous 2x1 reconstruction into rows of buckling Si dimers. The left part of Fig. 4 
shows a side view along these dimer rows. The adsorption of a single H atom 
breaks the double bond of a dimer. The result is a single dangling bond (left row 
but one in Fig. 4). If further H atoms are adsorbed at the Si(001) surface, either 
further dimers are broken or existing dangling bonds are saturated. In the case of 
complete saturation, each Si atom at the surface bonds one H atom. In Table 3, 
the EPR parameters for such microstructures are compared. Obviously, the g-tensor 
varies strongly with the hydrogen coverage. Especially the Z\g 3 -values along the 
principal axis of the g-tensor differ by more than one order of magnitude. In this 
sense, the g-tensor is by far more characteristic than the hf splittings which vary only 
within 20 %. Since in contrast to the hf splittings the sign of Ag can be determined 
experimentally, this holds also in case of powder spectra where only the isotropic 
part g av is available from experiment. 

The results and experiences obtained for the H-terminated silicon surfaces gives 
us confidence that our method will be acccurate enough to predict the spectroscopic 
magnetic properties of dangling bonds in real devices, like solar cells based on 
microcrystalline SiC: 

Microcrystalline silicon carbide f/ic-SiC) has become an attractive new class 
of advanced materials for light emitting diodes and heterojunction photovoltaic 
devices [24], Here, the microcrystallites are of interest as effective charge carrier 
collectors in organic solar cells. When a photon is absorbed by an organic 
photoactive material, an exciton, i.e., a bound state of an electron and a hole, is 
created. Due to the notably short exciton lifetime of several tens of nanoseconds, 
the most important design criterion for such solar cells is to make the thickness 
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Fig. 4 Structure and magnetisation densities for a single adsorbed H-atom {left) and for a nearly 
complete H-coverage of the Si(001) surface (H vacancy at monohydride surface, right) 


of the individual layers smaller than the diffusion length of the exciton [22, 23]. 
For this purpose, usually fullerene molecules are dispersed in a polymer matrix. 
One possibility to further increase the efficiency of this kind of solar cells is to use 
alternative acceptor materials with a more suitable position of the LUMO (lowest 
unoccupied molecular orbital) level. By adjusting the latter, the open circuit voltage 
of the solar cell can be increased at the expense of the energy lost by the electrons 
connected with charge transfer from donor to acceptor material. Instead of using 
rather expensive fullerenes, currently wide bandgap micro- and nano-crystals as 
acceptor materials are under discussion, such as TiC >2 [25], ZnO [26], and last but 
not least SiC [27]. 

Usually, the mentioned materials provide unavoidable technical difficulties, such 
as intrinsic defects, the lack of suitable /^-doping [28] or omnipresent n -doping. As 
an promising alternative, in Ref. [31] a sol-gel growth process for /zc-SiC has been 
proposed. The resulting material is almost free from usually unavoidable nitrogen 
donors, allowing arbitrary doping. In Ref. [32] electron paramagnetic resonance 
(EPR) was used as an analytic tool for the control of the doping success: Doping 
with N, Al, and P leads to different, characteristic EPR spectra. They are clearly 
different from those known for usual shallow donors and acceptors in bulk SiC. 
The nominally undoped and nitrogen-doped samples show an EPR line similar to 
a center discovered in porous 3C-SiC, assigned to a carbon dangling bond at the 
3C-SiC/SiC>2 interface. The g-factor is slightly different, but the line half width 
is almost the same. At first view, this similarity is very surprising since the 
microcrystallites are not oxidized: Even if the crystals became as large as 20 pm in 
diameter, they do not show the typical SiCF lines in nuclear magnetic resonance and 
infrared reflection spectroscopy. Electrical and photoluminescence measurements 
support the finding that the required acceptor behavior of //c-SiC is caused by 
surface-related defects in combination with an appropriate position of the Fermi 
level, which is determined by doping. 

Based on the experimental results, the microscopic structure of the responsible 
defect structure at the clean surface of the micro-crystallites and its influence onto 
the charge-separation mechanism is discussed with the help of ab initio calculations. 
In order to elucidate the microscopic origin of the observed EPR signals, we 
calculate the EPR parameters for some possible dangling-bond related structures 
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and compare them with the experimental values. For a modelling of the surfaces we 
use the settings already mentioned in connection with the partially hydrogenated 
silicon surfaces. In close analogy to the (OOl)-oriented silicon surfaces we first 
discuss the different defects at the corresponding surface of 3C-SiC (see Fig. 5). 
In silicon, the clean (001) surface is stabilized by rows of buckled dimers, which 
are still surviving if the surface is partially hydrogenated. Only, if nearly all silicon 
atoms are mono-hydrogenated, the dimers at the reconstructed surface lose their 
buckling. In 3C-SiC, the silicon-terminated Si-surface looks similar at first view. But 
due to the smaller bond length in SiC the buckling becomes less efficient, resulting 
in almost all cases in surfaces with very complicated reconstructions [29, 30] 
showing metallic and diamagnetic properties. Hence, these configurations cannot 
be responsible for the observed paramagnetic structures. 

Only an almost completely hydrogen-passivated configuration with one missing 
hydrogen (shown in Fig. 5a) leads to a paramagnetic surface state. In the following, 
we focus onto the carbon-terminated C-surface, where the situation is by far more 
straightforward. Here, similar to the case of diamond, already the clean surface 
provides a 2 x 1 dimer reconstruction without buckling (see also Fig. 5b-d). All 
these surface related defects provide paramagnetic states and are, thus, possible 
candidates to explain the g = 2.0073 EPR-signal of unknown origin. In Table 4, 
the calculated elements of the corresponding g-tensor are listed. Since no angular 
dependent EPR-measurements for the microcrystalline SiC-powder were possible, 
the only value that has to be compared to the experiment is g av , the average g-value 
over possible orientations. In all cases. Si-related hyperfine splittings in the range 
20-30 MHz can be found, but the calculated average g -values g av are far away 
from the experimental data. So, the corresponding models have to be discarded as 
explanation for the surface related EPR-signal. 

Nevertheless, it is worth to check the energetic level of the surface related 
defects. For this, we have to analyse the corresponding electronic band structures. 
Figure 6 ( middle ) shows the band structure of the clean C-terminated (001) surface, 
calculated using the gradient-corrected PBE functional. In comparison with the 
SiC-bulk material (left part ) additional broad bands appear in the gap. Those in 
the lower part are occupied covering a 1.2 eV broad region. The unoccupied bands 
overlap with the conduction bands of the bulk material, so that the position of the 
LUMO is lowered. In total, the fundamental gap is considerably reduced to 0.2 eV. 
Having in mind the well-known underestimation of band gaps by the local density 
approximation (LDA) or the gradient-corrected functionals like PBE (e.g. the 
fundamental gap comes out about 1.4eV while the experimental value is 2.4 eV for 
3C-SiC), the experimental values should be larger. To determine the exact positions 
of the band edges further elaborated calculations (e.g. hybrid functionals) would 
be necessary. Nevertheless, the LDA prediction of the gap considerably reduced 
from both the valence band as well as from the conduction band can be considered 
at least qualitatively correct. The result is something like an auto-doping of the 
3C-SiC microcrystals. The additional unpaired electron, introduced by a param¬ 
agnetic surface structure, leads only to an additional energy level clearly below the 
highest occupied surface state, but leaves all other features of the band structure 
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Fig. 5 Magnetization density of different surface-defects at the (001) surface of 3C-SiC. 
(a) Si-terminated surface, hydrogen passivated but with one missing H-atom Si(001)-H, (b) the 
corresponding defect at the C-terminated surface C(001)-H, (c) C-terminated surface with one 
H-atom adsorbed C(001)+H, (d) C-terminated surface with a substitutional nitrogen atom on a 
C-site C(001)+N c 



Fig. 6 Calculated band structure of 3C-SiC bulk {left) and for a 2 x 1 dimer-reconstructed SiC 
(001) C-surface (middle). The remarkable reduction of the fundamental gap (shaded grey) by the 
surface leads to an efficient auto-doping. Additional localized defects at the surface (right) induce 
additional defect levels (blue) and give rise to a spin-polarized band structure, but do not change 
the situation with respect to the LUMO considerably 


of the clean surface unchanged. In all cases, the gap is significantly reduced, and 
the pure existence of the surface leads to an efficient auto-doping. In other words, 
independent from details of the paramagnetic defect structure, already the nominally 
undoped micro-crystalline sol-gel 3C-SiC behaves as an efficient acceptor for the 
charge carriers collection. 
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4 Conclusions 

Ab initio calculations of the electronic g-tensor of paramagnetic states at surfaces 
and solar cells are presented. After discussing the numerical requirements for such 
calculations, we show that for silicon surfaces the g-tensor varies critically with 
the hydrogen coverage, and is by far more characteristic than the hf splitting of the 
Si dangling bonds or the adsorbed H atoms. This holds also in the case of powder 
spectra where only the isotropic part g av is available from experiments. Extending 
our calculations onto micro- and nano-crystalline 3C-SiC as a basic material for 
solar cells, our study shows that sol-gel grown material can serve as an excellent 
acceptor material for an effective charge separation in organic solar cells. It fits 
excellently into the energy level scheme of this kind of solar cell and has the 
potential to replace the usually used, rather expensive fullerenes. It turned out, that 
already undoped //c-3C-SiC acts as a particular suitable acceptor due to its auto¬ 
doping mechanism by a surface-induced band structure. 
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Ab-Initio Calculations of the Vibrational 
Properties of Nanostructures 


Gabriel Bester and Peng Han 


1 Introduction 

Colloidal semiconductor nanocluster research is a rapidly growing field driven by 
the attractive idea to tailor material properties by acting on the morphology of 
the structures. The modification of the optical properties by merely changing the 
diameter of colloidal quantum dots is one of the figureheads of nanostructure science 
[1-3]. It is the intense research effort towards the fabrication of nanostructures with 
favorable properties that has helped to establish most of the knowledge base we 
rely on today. Till now, the modification of the electronic and optical properties 
by changing the size of the nanoclusters are well understood theoretically and 
well controlled experimentally. One open problem of nanostructure science is the 
effects of temperature on the electronic and optical properties of nanoclusters and 
hence their vibrational properties. A theory at T = 0 K yields very valuable results 
to unveil certain aspects of the underlying physics, but to make predictions valid 
in the real world, where the physical properties such as temperature broadening, 
quantum coherence dephasing, spin-flip transitions and relaxation of charge carriers 
are key components [4-6], the effects of vibration and temperature on the dynamical 
processes must be addressed. 

The vibrational properties such as the phonon density of states (DOS) and 
dispersion of bulk semiconductors have been calculated with great accuracy using 
ab initio density functional perturbation theory (DFPT) since the end of the last 
century [7]. After the successful applications of DFPT on bulk phonon eigenmodes, 
ab initio studies on the vibrational properties of semiconductor nanostructures such 
as fullerenes, nanowires, nanotubes, and nanoclusters with small sizes have been 
performed [8-10]. However, an accurate density functional theory (DFT) study on 
the vibrational properties of nanoclusters with the experimentally relevant size of 
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few nm diameter has not been reported until now due to the high computational 
demand. 

With the computational facility available at the Hochst Leistungs Rechenzentrum 
Stuttgart (HLRS), we have recently calculated confinement and surface effects 
on the vibrational properties of colloidal semiconductor nanoclusters based on 
first-principles DFT. We describe how the molecular-type vibrations, such as 
surface-optical, surface-acoustic, and coherent acoustic modes, coexist and interact 
with bulk-type vibrations, such as longitudinal and transverse acoustic (LA, TA) 
and optical (LO, TO) modes. We could link the vibrational properties to structural 
changes induced by the surface and highlight the qualitative difference between 
III-Vs and II-VIs semiconductor nanoclusters [11]. We describe the specific heat 
of nanoclusters at low temperature and link the thermodynamic properties to the 
low frequency vibrational modes and the surface structure. We suggest that the 
low temperature specific heat should be a promising avenue to study the surface 
properties of nanoclusters. Since nanoclusters are believed to have only a certain 
fraction of their surface atoms directly passivated by ligand atoms, we study 
the effects of the removal of passivant and the reconstruction of the surface on 
the vibrational properties. We attribute the strong modification of the vibrational 
properties to the transformation from sp 3 to sp 2 bonds. 


2 Computational Methods 
2.1 Research Methodology 

A detailed review on density functional theory applied to lattice-dynamical cal¬ 
culations has been given elsewhere [7]. Here, we briefly outline our research 
methodology. Based on the adiabatic approximation (Born-Oppenheimer approx¬ 
imation), the lattice-dynamical properties of a system are given by 



( 1 ) 


where R/ is the coordinate of atom /, Mj its mass, {R} the nuclei configuration 
given as a set of atomic positions, h the Planck constant, s and <£({R}) the eigen¬ 
value and eigenvector of the lattice vibrations, respectively. is({R}) is the ground 
state energy of the system, which is determined by the many body Hamiltonian 



-V R (r) + V N ({R}) (2) 
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where m is the mass of the electron, e the electron charge, and r, the coordinate of 
electron /. The electron-nucleus interaction potential V R (r) is given by 


V R (r) = -J2 

i,I 


Z/e 2 

|r/-R/| 


(3) 


with Z/ represents the charge of the nucleus I. The electrostatic interaction 
potential Vjv({R}) is written as 


v N ({R}) 


ZiZj 

|R/-Rj|' 


(4) 


Based on the Hellmann-Feynmann theorem, the force acting on the nucleus / is 


Fi=- 


dE({R}) 

9R/ 


-(nr,{R})\ 


dH BO ({ R}) 
9R/ 


l^(r.{R})> 


(5) 


and the force constant matrix elements are 


3 2 £({R}) 

9R/9Ry 


/ 


dp R (r) 9Mr) 
9R j 9R ; 


dr + 



9 2 F fi (r) 

9R/9R y 


dr + 


3 2 ^v(!R}) 

9R 7 9Rj 


(6) 


where iT(r, {R}) is the electronic ground-state wave function and p R ( r) the electron 
charge density for the nuclei configuration {R}. The charge density p R (r) is 
obtained by mapping the problem onto a set of one-particle equations (Kohn-Sham 
equations): 


2m 9r 2 


+ ^(r) + e 2 


I 


Pr( rQ 

l r — r 1 


dr' + 


SE XC 

Sp R (r) 


]f n (r) = e n f n {r) 


(7) 


p R (r) = 2^|i//„( r )| 2 (8) 

n — 1 

where 8E XC is the exchange-correlation energy, e„ and i,//„ (r) are the eigen energy 
and wave function of the electronic states, respectively. 

Based on the harmonic approximation of lattice dynamics, the frequencies co and 
the corresponding eigenmodes u/ are obtained by solving the eigenvalue equation 


E 


_J_9 2 £'({R}) 

VM/My 9R/9Ry 


Uy = C0 2 Ui. 


(9) 


To analyze the eigenmodes in terms of core and surface contributions, we 
calculate the projection coefficients 
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a 


V 

c,s,p 


j^Nc.N s ,N p ) | Xy(/) | 

YT= i |X"(7)| 2 


2 


( 10 ) 


where, N c , N s , N p and N are the core, surface, passivant, and total number of 
atoms, X v (/) represents the three components that belong to atom I from the 3 N- 
component eigenvector. We define the surface atoms as the atoms belonging to the 
outermost seven layers of the cluster (around 3 A thick). From the phonon DOS 
D(a>) we obtain the specific heat according to: 

( tlCO \ 2 e^/ksT 

C v (T) = N A k B j Q (eW _ 1)2 ^)^- (ID 


2.2 Computational Details 

The nanoclusters we studied are constructed by cutting a sphere, centered on a cation 
with Td point group symmetry, from the zinc blende bulk structure and removing 
the surface atoms having only one nearest-neighbor bond. The surface dangling 
bonds are terminated by pseudohydrogen atoms H* with a fractional charge of 1/2, 
3/4, 5/4, and 3/2 for group VI, V, III, and II atoms, respectively. The calculations 
are performed using the local-density approximation (LDA), Trouiller-Martin norm- 
conserving pseudopotentials with an energy cutoff of 30 Ry for III-Vs and 40 Ry for 
II-VIs. 

The geometry relaxation is performed using the Broyden-Fletcher-Goldfarb- 
Shano (BFGS) procedure for the optimization of the ionic positions. The forces 
are minimized to less than 3 x 1(T 6 a.u. (5 x 10 4 a.u.) under constrained symmetry 
for the passivated (unpassivated) nanoclusters. With the optimized geometry, the 
dynamical matrix elements 3 3l f are obtained by solving Eq. (6). In the calcu¬ 
lation, the charge density Pr(t) are obtained by solving the Kohn-Sham equation 
self-consistently and the values of 3 g^ r> are calculated using a finite difference 
approach. In principle we need 3 N atomic displacements to obtain all the elements 
of the dynamical matrix (N being the number of atoms). In practice we calculate 
a significantly lower number of displacements (3/V/24) and use the symmetry 
elements of the point group to deduce the missing elements. This is a key points 
to be able to treat these large structures. 

All the calculations are performed with the CPMD code package developed at 
the Max Planck Institute in Stuttgart and at IBM in Zurich [12]. The CPMD code 
is a high performance parallelized plane wave/pseudopotential implementation of 
DFT. It offers, at the moment, the best scaling among the DFT codes using a hybrid 
scheme of MPI and OpenMP. 

In this project, all the calculations were carried out on the NEC Nehalem Cluster 
at HLRS with 2.8 GHz and 12 GB memory per nodes, and infiniband interconnects. 
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The details of the scaling behavior and the performance per CPU for CPMD code 
have been given elsewhere [13]. 


3 Results 

3.1 Confinement Effects on Vibrational Properties 

We have calculated the vibrational properties for a total of 23 nanoclusters made 
of InP, InAs, GaP, GaAs, CdS, CdSe, and CdTe. The wave function of the lowest 
unoccupied molecular orbital (LUMO) state for Gas 3 i As 532 H;] 12 nanocluster with 
isosurface corresponding to 75 % of the maximum value are presented in Fig. 1. In 
this report, we extract the essence of these calculations and select representative 
results. The vibrational DOS of InP and CdS nanoclusters along with the bulk 
phonon DOS are plotted with a broadening of 0.8 cm -1 in Fig. 2. Although formally 
TA, LA, and TO, LO phonon modes cease to exist in a nanocluster, the comparison 
with the bulk phonon DOS in Fig. 2e, j reveals an obvious bulk parentage. From 
Fig. 2, we see that: (i) The III-V (InP) nanoclusters show a blueshift of LO-, TO-, 
and LA-derived cluster modes with decreasing size while this blueshift cannot be 
found in II-VIs (CdS). (ii) The surface modes tend to completely fill the acoustic- 
optic phonon gap in II-VIs but not for III-Vs. (iii) The “broadening” of the bulk-like 
optical phonon branches induced by the confinement is larger for II-VIs than for 
III-Vs. 

These three effects can be understood from the geometry of the relaxed nan¬ 
oclusters. We plot the nearest-neighbor distances of relaxed III-Vs and II-VIs 
nanoclusters as a function of the distance of the respective bond to the cluster center 
in Fig. 3. From this figure, we see that the bond length at the dot center is reduced 
through the presence of the surface in all cases. The bond length distribution of 
III-Vs and II-VIs exhibits qualitative differences. For III-Vs, the surface shells 
show a successive reduction of bond length, going outward, while II-VIs show a 
large bond length distribution. The overall reduction of bond length in III-Vs along 
with the positive Griineisen parameters explains the blueshift of the LO-, TO-, and 
LA-derived cluster modes. The lack of shift in the TA modes stems from the small 
negative Griineisen parameter for this branch. Moreover, we attribute the broadening 
of optical branches and the filling of the phonon-gap in II-VIs to the results of the 
large bond length distribution. 

After discussing the confinement effects on the high frequency vibrational 
modes, we now focus on the low frequency modes. In Fig. 4, we plot the size- 
dependent lowest frequencies = v/4 R calculated from the longitudinal and 
transverse speed of sound v and cluster radius R as solid and dashed curves. The 
circles and the crosses are the lowest core and surface acoustic modes obtained 
from the DFT calculations. We see that the lowest core modes follow closely the 
analytic l/R dependence while the surface acoustic modes are strongly affected 
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Fig. 1 The wave function of 
the lowest unoccupied 
molecular orbit (LUMO) state 
for a Ga 531 As532H * 12 
nanocluster with isosurface 
corresponding to 75 % of the 
maximum value. The colors 
blue and red give the phase of 
the wave functions 



Fig. 2 Vibrational density of 
states (DOS) contributed by 
core atoms (black) and 
surface atoms (red) for 
(a)-(d) InP clusters, (e) bulk 
InP, (f)-(i) CdS clusters, and 
(j) bulk CdS 



Frequency (cm' 1 ) 


by the morphology of the surface and are not monotonous with cluster size. 
Another important type of vibrational modes are the so-called coherent acoustic 
modes, in which all the atoms vibrate in phase. The coherent phonon modes have 
been observed with Ramans spectroscopy, far-infrared absorption spectroscopy, and 
resonant high-resolution photoluminescence spectroscopy, and are now the center of 
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Fig. 3 Bond-length 
distribution as a function of 
their distance to the dot center 
for (a) InP, (b) CdS, 

(c) III-Vs, and (d) II-VIs. 
LDA bulk bond lengths are 
given as dashed lines 




Fig. 4 Size-dependent low frequency vibrational modes for (a) InAs and (b) CdS. Lowest modes 
with bulk character (circles), surface acoustic modes (crosses), lowest breathing modes (triangles), 
and experimental results (Oron et al. [14] and Saviot et al. [18]) for the coherent acoustic modes 
(diamonds). Lowest spheroidal mode according to the Lamb model (dashed-dotted line), according 
to the confined bulk model using the sound velocity of the TA- (solid lines) and LA-branch (dashed 
lines) 


attention when the manipulation of spins and the spin dephasing is investigated [14- 
16]. We plot our results as triangles along with the experimental data as diamonds 
and the results from the Lamb model as dashed-dotted line. Our results are in 
good agreement with the experimental results (although our clusters sizes are still 
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somewhat smaller than experiment in the case of InAs) and with the simple Lamb 
model. 


3.2 Passivated versus Unpassivated Nanoclusters 

In the real world applications, the nanoclusters are believed to have only a certain 
fraction of their surface atoms directly passivated by ligand atoms [17]. Thus, we 
studied an extreme situation where the nanoclusters are unpassivated [19]. In Fig. 5, 
we compare the vibrational DOS of a fully passivated with geometry optimization 
(a), an unpassivated nanocluster without (b) and with (c) geometry relaxation. By 
removing the passivants while frozening the atomic positions, we effectively create 
dangling bonds in the sp 3 hybrid orbitals. In this frozen geometry, the orbitals 
cannot redistribute effectively and the bonds close to the surface are weakened. This 
leads to the red shift of the modes with surface character. Once the geometry is 
optimized, the dangling bonds tend to transform from sp 3 to sp 2 hybrid orbitals and 
strengthen. This leads to a blue shift of vibrational modes with a magnitude even 
greater than the fully passivated cluster and in general a “broadened” vibrational 
DOS. To understand this effect, we plot in Fig. 6 the relaxed unpassivated cluster (a) 
and the nearest neighbor distances of a relaxed passivated and a relaxed unpassivated 
InP nanocluster as a function of the distance of the respective bond to the cluster 
center (b). From Fig. 6b, we see that the bond length reduction close to the surface 
is more significant in the case of the unpassivated cluster and this results in the blue 
shift. Moreover, we can also find that the variation in bond lengths is significantly 
larger in the unpassivated cluster, which leads to the broadening of the vibrational 
DOS. 


3.3 Thermodynamic Properties 

Following the discussion on the vibrational properties, we now turn to the thermody¬ 
namic properties of nanoclusters. The vibrational specific heat C V (T) is calculated 
using Eq. (11) with the vibrational DOS from DFT computations. Different aspects 
of the results are summarized in Fig. 7. 

In Fig. 7a, we plot C V (T)/T for CdSeH* nanoclusters with different size as 
a function of T 2 . With this choice of axis, the Debye T 5 regime would appear 
linear. We identify two distinct regions, region 1 below 24 K 2 and region 2 above 
this temperature. In region 1, C V (T)/T shows a strongly non-linear dependence 
and it converts to a nearly linear behavior in region 2. Most surprising is that 
we find the smaller nanocluster has a larger specific heat than the larger ones. 
To understand these behaviors, we plot in Fig. 7b the low frequency vibrational 
spectrum along with the percentage of surface character. The temperature region 
1 roughly corresponds to the frequency region below 30 cm -1 in (b), and the small 
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Fig. 5 Vibrational DOS of 
InP nanoclusters with 
(a) passivant, (b) 
unpassivated without 
relaxation and 
(c) unpassivated with 
relaxation 



0 100 200 300 400 

Frequency (cm' 1 ) 



Fig. 6 (a) Relaxed geometry of un-passivated In 32 iP 3 i 2 nanoclusters, (b) Bond length distribution 
as a function of their distance to the dot center for In 32 iP 3 i 2 H| 00 . The LDA bulk bond length is 
given as dashed lines 


nanoclusters have more vibrational modes (higher vibrational DOS) than the larger 
ones in this frequency region and these modes have surface character. We can 
see that the surface modes move up in frequency with increasing cluster size and 
attribute this effect to the atomic configuration on a curved surface. As depicted in 
the inset of Fig. 7a, the stronger curvature of smaller dots leads to an “open” surface 
that allows for softer surface modes. From Fig. 7b, we see that the vibrational modes 
with core characters contribute to the specific heat when the temperature arrives 
into region 2. These vibrational modes show a red shift with decreasing cluster size 
due to the negative Griineisen parameter along with the contraction of the surface. 
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Fig. 7 Specific heat C v divided by the temperature T as a function of T 2 for different CdSe 
nanoclusters. In (b) the low frequency eigenmodes are drawn as vertical lines for three different 
CdSe nanoclusters. The solid circle gives the percentage of surface character of each mode. In 
(c) we estimated the slopes in the (nearly) linear regime of panel (a) for CdSe nanoclusters, 
changing the passivant mass. In (d) we report the corresponding onset temperature of C„ 


These bulk-like contributions are reflected in the nearly linear behavior in Fig. 7a. 
We estimated the slopes and the onset temperature of the curves and plotted them 
in Fig. 7c, d with the passivant mass as hydrogen, carbon, oxygen and phosphorous. 
The slope reduces with increasing size in all cases and it increases with increasing 
mass of passivant. This behavior can be understood from the reduced frequency with 
increasing passivant masses. From Fig. 7d, we find that the smaller nanoclusters 
have an earlier onset than the larger ones. This is again related to the surface mode 
softening we reported for decreasing cluster size. 


4 Summary 


In summary, we have performed ab initio DFT calculations to study the confinement 
and the surface effects on the vibrational and thermodynamic properties of III-V 
and II-VI colloidal nanoclusters with up to thousand atoms. We can identify the 
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following confinement and surface effects, (i) The LA, TO and LO-derived cluster 
modes of III-V clusters significantly blue shift with decreasing cluster size. For 
II-VI clusters this shift is absent but the broadening of bulk derived modes is 
significant and the gap between optical and acoustic phonons is filled by surface 
modes, (ii) We can clearly ascribe these observations to the large relaxation of the 
clusters dominated by an inward relaxation of the surface penetrating deep inside 
the cluster in case of the III-Vs and a large distribution of bond length at the surface 
of II-Vis. These strong confinement effects tend to disappear for clusters with more 
than 1,000 atoms, (iii) We find surface optical modes in the phonon gap and surface 
acoustic modes as the lowest frequency modes. The coherent acoustic phonons are 
identified and found to be in good agreement with results from the Lamb model and 
experiment, (iv) In the unpassivated nanoclusters, the unpaired electrons in the sp 3 4 5 6 7 8 9 10 11 12 
hybrid orbitals reduce the bond strength and result in red shift with frozen geometry. 
Once the geometry is optimized, the dangling bonds tend to tra1 nsform from sp 3 to 
sp 2 hybrid orbitals and leads to a blue shift of vibrational frequencies, (v) The low 
temperature specific heat reflects the surface acoustic vibrational modes and we 
suggest to study the low temperature specific heat to access the surface properties. 
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Part III 
Reacting Flows 

Prof. Dr. Dietmar Kroner 


Two contributions in this section about Reacting Flows impressively show how 
the results obtained by high performance computing can be used for the design 
of realistic combustors in order to increase the efficiency or to avoid damages. 
Furthermore in two other contributions parameter studies for flames are performed 
and the results are compared with available experimental data. 

In the contribution about “Conservative Implementation of LES-CMC for Tur¬ 
bulent Jet Flames” by P. Siwaborworn and A. Kronenburg, the authors carry out 
parametric studies for turbulent jet flames, in particular for a so-called Sandia D 
flame with available experimental results. The method used for this purpose is a 
combination of large eddy simulation (LES) and of conditional moment closure 
(CMC) to account for turbulence due to chemistry. Two studies are carried out. One 
deals with the combustion model and investigates the impact of a so-called FDF- 
weighting function in the governing equations which is believed to be the main 
reason for inaccurate prediction. The other deals with the impact of the CMC grid 
resolution. The study demonstrates that inclusion of the FDF-weighting function 
leads to more accurate solution. The simulation have been performed on the NEC 
Nehalem Cluster with up to 720 cores. The parallel efficiency falls below 40 % if 
more than 180 cores are used. 

In “Numerical Investigation of a Complete Scramjet Demonstrator Model 
for Experimental Testing under Flight Conditions” by Y. Simsont, P. Gerlinger, 
M. Aigner a scramjet model has been investigated numerically. The simulation 
corresponds to a realistic scramjet model which has been tested for wind tunnel 
experiments at Mach 8. The results are used for improving the design, such that 
the inflow conditions and the distribution of the temperature are optimized. For 
instance the simulations of the flow through the original geometry do not indicate 
selfignition. Therefore additional wedges have been attached and it could be shown 
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that locally the temperature increases such that self-ignition oocurs. The underlying 
mathematical model consists of the compressible Navier-Stokes equations and 
transport equations for the species with reactive source terms. The results have been 
obtained with the software package TASCOM3D on the NEC SX-9 mit 16 CPUs. 
The performance of the code is carefully studied. It turns out the it was not so strong 
as results on the older SX-8 machine. This project was supported by the DFG. 

The paper “A Unified TFC (Turbulent Flame-Speed Closure) Combustion Model 
for Numerical Computation of Turbulent Gas Flames” by F. Zhang, P. Habisreuther, 
M. Hettel and H. Bockhorn concerns the modeling and simulation in turbulent 
compressible flows, in particular the modeling of premixed, non-premixed and 
partially premixed combustion. The numerical results are obtained by using the 
OpenFOAM software, which is based on centered Finite-Volume schemes. Two 
different turbulent models, RANS and LES are compared. For the combustion 
mechanisms for the species “state-of-the-art” methods are used, e.g. GRI 3.0 or 
Maas/Pope. The details of the chemical reactions are computed using CHEMKIN 
II and the results are integrated into the simulations via tables. The results of all 
simulations show reasonably good agreement with the experiments. The computa¬ 
tional effort for RANS is much lower than that for FES. For the parallelization an 
efficiency of 55 % has been obtained. The main result has been summarized by the 
authors as follows: “Nevertheless, there are only very few alternative models until 
now which could be used to simulate such different flames using one single reaction 
model. There are models indeed, which could provide better results for some of the 
demonstrated flames, but these will completely fail elsewhere.” This project was 
supported by the DFG. 

In “Lagrangian Approach for the Prediction of Slagging and Fouling in Pulver¬ 
ized Coal Combustion” by O. Femp, U. Schnell and G. Scheffknecht the authors 
present a modeling approach for the prediction of slagging and fouling in industrial 
coal-fired boilers. Both slagging and fouling may cause damages of the furnace 
and lead to an immense decline of power plant efficiency. The simulations done so 
far (in absence of measurements) are thought to contribute to design, investigation 
of damage processes as well as optimization of boiler performance. Simulation 
have been performed using the CFD code AIOFOS based on Finite-Volume 
methods. Chemically reacting flow is described by convection- diffusion-reaction 
equations. Turbulence is computed via the k—e model and for chemistry-turbulence 
interactions the Eddy-Dissipation concept is employed. Since realistic (industrial) 
scenarios are considered, high-performance computing must be used in order to get 
effective hints for a better design, in particular for an optimization of the boiler 
performance. The computation is based on six million computational cells and the 
calculation were executed on the NEC-SX8 with 8 CPUs and on the NEC-SX9 with 
16 CPUs. The efficiency is lower than for previous computations. 


Conservative Implementation of LES-CMC 
for Turbulent Jet Flames 


P. Siwaborworn and A. Kronenburg 


Abstract The objective of the present work is to validate a large eddy simulation 
(LES) approach that has been coupled with a conditional moment closure (CMC) 
method for the computation of turbulent diffusion flames. Contrary to earlier work, 
we use a conservative implementation of CMC that ensures mass conservation of 
the fluxes across the computational cell faces. This is equivalent to a weighting of 
the fluxes by their probabilities at the cell faces, and it is thought that this weighting 
leads to a more dynamic response of the conditionally averaged moments to tempo¬ 
ral changes induced by the large scale turbulent motion. The first application to the 
Sandia Flame Series D-F allows for the validation of the method, but further studies 
with different flame geometries and more pronounced large scale instationary effects 
will be needed for the demonstration of the benefits of conservative CMC when 
compared to the conventional (non-conservative) implementation. 


1 Introduction 

The Large eddy simulation (LES) approach is considered to be the most promising 
approach for the computation of turbulent flows in applications of engineering 
interest. LES solves large scales of turbulent flows up to grid sizes using spatial 
filtering and models subgrid scales using Smagorinsky model. CMC is applied for 
a turbulent combustion modelling using mixture fraction as a conditional variable. 
A non-conservative LES-CMC has provided predictions of major and minor species 
for different flames in a last decade. However, inaccurate predictions occur in CMC 
cells which have large temporal variations of the mixture fraction field. A lack of 
FDF-weighting (filtered density function) ratios in a convective term of the non¬ 
conservative CMC is believed to be the main reason for inaccurate predictions. 
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In contrast to non-conservative LES-CMC, the present conservative formulation is 
inherently mass conserving. It considers FDF-weighting ratios in convective term 
so that improved predictions of local conditional scalars can be obtained. 

In this work, investigations of turbulent jet flames (Sandia Flame D, E and F) 
are performed by the conservative LES-CMC approach. Flame D is used as the 
first test case to validate the numerical results by comparison with well-established 
experimental data. Subsequently, Flames E and F are investigated for extinction 
and reignition phenomena. Computational results of the present work from using 
HLRS resources are given first. Subsequently, the computational resources which 
have been used to simulate Sandia Flames series will be addressed. 


2 Results of the Present Work 

In this section, the computational setups are described, followed by the summaries 
of parametric CMC studies in Sect. 2.2. Section 2.3 presents the simulation results 
for Sandia Flame D, which is carried out as the first test case in order to validate 
the LES-CMC simulation models as a reference. Predictions of Flame D are also 
chosen as representatives in this paper, since the simulation results of Flames E 
and F follow the same tendency as predictions of in Flame D. This section closes 
with the discussion and conclusions about the performance of various CMC model 
parameters in Sect. 2.4. 


2.1 Computational Setup 

Computational grid was generated with dimensions of 80 D in z-direction and 8Z) in 
x- and y-directions at the flame base increasing to 60 D at the outlet of the domain. 
Fine grid simulations of Sandia Flame series, having 112 x 112 x 320 cell grids 
for LES and 8 x 8 x 80 cell grids for CMC (reference case), have been carried 
out with HLRS resources. The regions above the jet and pilot are captured by 28 
and 40 LES cells (for a dimension), respectively. Due to grid independence studies 
by Navarro et al. [1], these computational cells satisfy a condition that the largest 
fraction of energy spectrum is resolved after the initial break-up of the jet. The CMC 
grid has 100 nodes in mixture fraction space which has refinement at ij = 0 and 1. 


2.2 Parametric Studies 

The results of the LES-CMC modelling are presented in three main parts. Parametric 
studies of flow and mixing field, parametric studies of combustion model and 
parametric study of CMC grid resolution are investigated in order to study the effects 
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of each parameter on simulation results. Details of these studies for each parameter 
are given in the following sections. 


2.2.1 Parametric Studies of Flow and Mixing Field 

The parametric studies of flow and mixing field are the inflow velocity variances in 
turbulent inflow generator, the Schmidt number. Sc, the turbulent Schmidt number. 
Sc,, and constant value, Q, which is used in modelling of the variance of mixture 

fraction Besides Q, all of them are fluid properties 

which actually are not allowed to be changed. However, the velocity variance levels 
are adjusted to yield suitable values for the inflow generator which creates the 
oscillation of the velocity field in this work. The Schmidt number and the turbulent 
Schmidt number, which are the ratios of momentum transfer rate to mass transfer 
rate for resolved and unresolved scales, are varied to test a sensitivity of the flow 
and mixing field. Sandia Flame D is used as a reference case and thus simulation 
results from these optimal values are reported in Sect. 2.3. 


2.2.2 Parametric Studies of Combustion Model 


The evaluation of the LES-CMC combustion model is in the focus of the present 
research project. The parametric studies concerning the combustion model carried 
out here comprise the evaluation of the CMC formulation, the approximation of 
the CMC convective fluxes and the model for the conditionally filtered turbulent 
diffusivity. 

• CMC Formulations 

The difference of the two CMC formulations (the non-conservative form in 
Eq. 1 and the conservative form in Eq. 2) is the inclusion of the FDF information 
into the transport equation, in particular into the convective term (the second term 
on the LHS) of the conservative CMC formulation. 


9<2« , ~ 3 Q a 
3 1 +Uj ' r >d xj 



+ N , 


3 2 <2t> 

dr] 2 
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where y denotes pP(rf) and P(t]) is the Favre filtered probability density 
function (FDF). Q a = Y a | r] is the conditionally filtered mass fraction. 7f, ( is 
the conditionally filtered velocity, w„ is the conditionally filtered reaction source 
term and N rj is the conditionally filtered scalar dissipation rate. Term D n is 
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the modelled -by using of a gradient diffusion approximation- of the subgrid- 
scale conditional scalar flux. Based on finite volume method, both conditional 
species transport equations can be applied to each control volume (CV) of the 
computational domain. It is believed that including the FDF information in con¬ 
vection will make the CMC conservative form more precise than the traditional 
one [2], Therefore, the study of two different formulations of the combustion 
model is performed in this work to reveal the results of the assumption. 

• Flux Approximations 

Since each CMC cell comprises of a number of LES cells, two methods can be 
applied to approximate the convective flux between CMC cells. The first method 
calculates convective flux over the CMC cell face from the convective fluxes 
of the LES cells adjacent to the CMC cell face. The second method computes 
convective flux from the values at the CMC cell centre, which takes into account 
the values of all LES cells within the CMC cell. The convective flux from the 
first method is shown as the summation of the small arrows in Fig. 1 , while the 
convective flux from the second method is shown as the big arrow in the same 
figure. 

• Conditionally Filtered Turbulent Diffusivity Models 

There are three methods to model the conditionally filtered turbulent diffu¬ 
sivity, D ;j . However, all of them are based on the Smagorinsky model for the 
subgrid-scale kinematic viscosity, v t and the relation of unconditionally filtered 
turbulent diffusivity, D t = v,/Sc t - In the first method, D n is calculated based 
entirely on CMC cell (named D n \). Therefore, v, in this method is calculated 
based on CMC grid resolution. In the second method, D n 2 is calculated based 
on D t from every LES cell which locate inside that CMC cell. The ensemble 
averaging over a CMC cell can be computed by weighting with FDF. In the third 
method , 3 , the ratio of the size of CMC cell to LES cell is included into the 
second method in order to adjust the length scale during modelling D rj value. 
This additional value is hoped to predict more accurately since the D ;j model 
should be based on the filter width of CMC instead of on the filter width of LES. 


2.2.3 Parametric Study of the CMC Grid Resolution 

Another main parametric study for all Sandia Flame series is the CMC grid 
resolution. For a simple and stable flame, this parameter might not reveal any effect. 
However, it is believed that a high number of CMC cells may capture the extinction 
and reignition phenomena due to the turbulence-chemistry interactions in Flames E 
and F. Thus, three CMC grid resolutions are performed in this study topic. These 
are 4 x 4 x 80, 8 x 8 x 80 and 16 x 16 x 80 CMC cells for the same LES resolution. 

To summarize, all parametric studies are shown in Table 1 and they were varied 
for Flame D. The values resulting in the best agreement between simulation and 
experiments of parametric studies of flow and mixing field of Flame D were chosen 
for further simulations of Flames E and F. 
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Fig. 1 A schematic of the 
two approximations of the 
CMC convective flux 


CMC cell 


LES cell 

■"I 




2.3 Results of Sandia Flame D 

Results of Sandia Flame D are composed of three principal parts. Firstly, results of 
the parametric studies of flow and mixing field are reported. Subsequently, results 
of the parametric studies of combustion model are shown and discussed. Finally, 
results of the parametric study of CMC grid resolution are given and discussed. 


2.3.1 Parametric Studies of the Flow and Mixing Field 

Using a reference case of parametric studies in combustion model (CMC-1, flux-1 
and D,, 2 ) and CMC grid resolution of 8 x 8 x 80, the best results of parametric 
studies in flow and mixing fields are variance-2, Sc 2 , Sc t ,i and Q, i. The meaning 
of each parameter can be found in Table 1. Overviews of best results in flow and 
mixing filed which use these optimum values can be observed in Figs. 2 and 3. 

Figure 2 shows a snapshot of the instantaneous temperature field along a 2D 
plane through the burner centerline for the whole computational domain (left) 
and focused on the upstream region (right). The black lines indentify the isoline 
of stoichiometric mixture fraction. It can be observed from a temperature profile 
(Fig. 2 (right)) that there is a high level of turbulence which comes from the digital 
turbulent inflow generator at the inlet. Moreover, the local extinction, which would 
be characterized by discontinuous red color of the temperature along the isoline of 
stoichiometric mixture fraction, hardly occurs in Flame D, in accordance with the 
experimental findings. 

Radial profiles of mean and RMS axial velocity and mixture fraction at three 
downstream locations are shown in Fig. 3. Both mean and RMS of axial velocity and 
mixture fraction agree properly with the experiments [3,4], since the effects of initial 
inflow from inflow generator are previously checked and adjusted. Moreover, it can 
be seen form Fig. 3 that the jet spreading is captured well. Small overpredictions 
of the mean mixture fraction in the range of 1.2 < r/D < 2 at position z/ D = 3 
may come from the influences of lateral boundary conditions. Small overpredictions 
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Table 1 Summary of parameters studies 


Quantity 

Name 

Values or methods 


variance-1 

u'u' , vV and w'w' [3] 

Variances of inflow generator 


variance-2 

| u'u', |vV and |v/v/ 


Sci 

0.4 

Schmidt number 

Sc 2 

0.7 


Sc 3 

1.0 

Turbulent Schmidt number 

Sc,] 

0.4 


SC,. 2 

0.7 

Variance of mixture fraction 

Q, i 

0.2 


Q, 2 

0.3 


CMC-1 

Conservative CMC 

CMC formulation 


CMC-2 

Non-conservative CMC 


flux-1 

Computing fluxes based 
on LES cells at CMC faces 

Convective flux 


flux-2 

Computing fluxes based 
on CMC cell centers 


Dri, 1 

Modelling D n based on CMC cells 

Conditionally filtered 


D v ,2 

Modelling 7), based on LES cells 

turbulent diffusivity 



Modelling D n 

with adjusting the length scale 


4 x 4 x 80 

4 CMC cells in X- and Y-directions 

with 80 CMC cells in Z-direction 

Number of CMC cells 

8 x 8 x 80 

8 CMC cells in X- and Y-directions 

with 80 CMC cells in Z-direction 


16 x 16 x 80 

16 CMC cells in X- and Y-directions 

with 80 CMC cells in Z-direction 


of the mean axial velocity and mixture fraction around 1 < r / D < 2 at position 
z/D = 15 may require more simulation time for more precision. However, the 
current predictions provide a good basis for the parametric studies of combustion 
model. 





Conservative Implementation of LES-CMC for Turbulent Jet Flames 


165 



X/D x/D 


Fig. 2 Snapshots of the temperature field in total computational domain (left) and in the upstream 
region (right) for Sandia Flame D. The iso-contour of stoichiometric mixture fraction is presented 
by the black lines 

2.3.2 Parametric Studies of the Combustion Model 

As shown in Table 2, live case studies are performed to show the effects of each 
case in CMC model. All cases are based on the optimal conditions from parametric 
studies of flow and mixing field (variance-2, Sc 2 , Sc t i and Cf \) and use 8 x 8 x 80 
for the number of CMC cells. A reference case (case-1) includes the models 
CMC-1, flux-1 and D n 2 , while other cases have at least one varied parameter 
compared with the reference case. 


Preliminary Studies 

Preliminary studies of parametric studies of combustion model are required to 
choose the cases which may predict the simulation results different from the 
reference case (case-1 from Table 2). These case studies will be further examined 
for the simulation results in next sections. In the first step, a representative direction 
has to be defined. Having the highest convective value among three directions, 
the convective flux in z—direction is chosen as a representative. Note that a 
consideration of FDF profile is required, since it is applied to transfer the values 
from a mixture fraction space to a physical space. The low value of FDF means 
only a small influence of any property can appear in the physical space. 

Subsequently, the convective fluxes in z—direction are shown in radial and 
axial distributions. The instantaneous predictions of CH 4 fluxes in Fig. 4 show that 
the radial distribution exhibits larger differences of the convective fluxes between 
each case than the axial distribution. Thus, the radial distribution is applied for 
investigations of convective flux comparisons for the other species. It should be 
reminded that a consideration of FDF is necessary as previously discussed. 
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iJ D=3 z/D=3 



Fig. 3 Radial profiles of mean and RMS axial velocity and mixture fraction at three downstream 
locations for Flame D. Symbols denote experimental values [3,4], while the solid and dashed lines 
present the mean and RMS values of LES-CMC (reference case of Table 2) 


Table 2 Summary of different parameters in combustion model study. The meaning of each 
numerical method can be found in Table 1 


Name 

Combustion model 

Flow and mixing field 

CMC grid resolution 

case-1 

(reference case) 
case-2 

CMC-1, flux-1, £>,. 2 

CMC-2, flux-1, 2 

variance-2 


case-3 

CMC-1, flux-2, £>„ j2 

SC 2 , Sc tt i 


case-4 

CMC-1, flux-2, D rj] 

Cm 

8 x 8 x 80 

case-5 

CMC-1, flux-1, Z\ 3 
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CH 4 Flux at z/D = 7.5 
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Fig. 4 Axial {left) distribution along the centerline and radial {right) distribution at z/D = 7.5 of 
convective fluxes in z-direction of CH 4 for a time step (Sandia Flame D) 


The investigations of fluxes in other species show that fluxes in different species 
follow the same tendency of CH4 fluxes in Fig. 4. It can be observed that case-2, 
case-3 and case-4 produce different convective fluxes compared with the reference 
case (case-1). However, case-3 and case-4 produce similar flux, which means their 
conditional scalar predictions should not be different. Therefore, case-3 is chosen 
for the further parametric studies. Fluxes of case-1 and case-5 are similar and also 
case-3 and case-4 are similar because of low effects of different D n models on 
the convective term. Since case-1, case-2 and case-3 produce different convective 
fluxes, the conditional scalar predictions from these cases should differ from each 
other. 

In next step, the statistical results of three cases (case-1, case-2 and case-3) are 
sampled over 30,000 time steps for the statistics to investigate the different effects 
of combustion model parameters and to validate with the experiment. 
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Fig. 5 Conditional profiles of cross-sectionally averaged temperature and CO at two different 
downstream positions in mixture fraction space for Flame D. Symbols are experimental data [4], 
while the solid, dashed and dotted lines present the results of LES-CMC in different cases in 
combustion model (Table 2) 


Conditionally Filtered Reactive Scalars 

As in CMC methodology, species mass fraction are analyzed in mixture fraction 
space, the efficiency of the combustion model can be decoupled from flowfield 
predictions. Therefore, the performance of each case of combustion model can 
be directly considered from conditional profiles. Initial values for the conditional 
reactive species are obtained from the SLFM solution. 

A good agreement of case studies with experiments is shown in Figs. 5 and 6 . 
The conditional mean temperature and the conditional mean mass fraction of CO, 
CH 4 and FI 2 are given in both figures as a representative of intermediate products, 
fuel and radicals. Note that the error bars indicate the conditional RMS and they 
are only plotted to illustrate the turbulent level of each scalar. The reason of 
different predictions between case-1 and case-2 on the rich side (rj > 0.35) belongs 
to two different sets of convective fluxes which are calculated from two CMC 
formulations. Because of the lack of FDF-weighting function in convective term, 
the low convective fluxes on the rich side are generated in the upstream positions of 
case-2. These can be observed in Fig. 7. Therefore, case-2 (non-conservative CMC) 
usually overpredicts on the rich side of mixture fraction in temperature, radical and 
intermediate product, while the underpredictions occur in the fuel. 
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Fig. 6 Conditional profiles of cross-sectionally averaged CH 4 and H 2 at two different downstream 
positions in mixture fraction space for Flame D. Symbols are experimental data [4], while the solid, 
dashed and dotted lines present the results of LES-CMC in different cases in combustion model 
(Table 2) 



Fig. 7 Radial distribution of mean convective fluxes in z-direction of CH 4 at z/D = 7.5 (Sandia 
Flame D). The solid, dashed and dotted lines present the results of LES-CMC in different cases 
and FDF (reference case) in combustion model (Table 2) 


It can be observed from the Figs. 5 and 6 that case-1 (conservative CMC) 
shows more accurate results than case-2. Cross-sectional simulation results from 
case-3 hardly differ from case-1, even though the convective fluxes from both cases 
are different (Fig. 7). The reason can be explained by using FDF value for each 
CMC cell in the same cross section to calculate the cross-sectional averages. If the 
conditional predictions between two cases of any CMC cell have a difference where 
the low FDF is calculated in mixture fraction space, the conditional predictions in 
cross-sectional averages will be similar. 
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Table 3 Summary of different cases in CMC grid resolution study 


Name 

CMC grid resolution 

Flow and mixing fields 

Combustion model 

res-1 

4 x 4 x 80 

variance-2 


res-2 

8 x 8 x 80 

Sc 2 , Sc ,,i 

CMC-1, flux-1 

(reference case) 
res-3 

16 x 16x80 

Q.i 

D^'2 


2.3.3 Parametric Study of CMC Grid Resolution 

As described in Table 1, three cases of CMC grid resolution are varied, while the 
same conditions of flow and mixing field (variance-2, Sc 2 , Sc t \ and Q,i) and 
CMC combustion model (CMC-1, flux-1 and D n 2 ) are set up. The variations of 
the CMC cells in each x— and y— direction are 4 cells for case-1, 8 cells for case-2 
(reference case) and 16 cells for case-3 with the same 80 CMC cells in z— direction, 
as summarized in Table 3. 

Since the CMC resolution varies in the radial distribution for three case studies, 
the radial distribution of conditional value should show more prominent features. 
Therefore, the radial distribution of mean scalar are investigated at position z/Z) = 3, 
7.5 and 15. The mean temperature and CO predictions in radial distributions are 
shown in Fig. 8. 

It can be seen from position z/ D = 3 that res-2 and res-3 perform better than 
res-1 since there is an underprediction of the temperature for res-1 in this position. 
Predictions of res-3 can capture the highest value of CO at position z/D = 7.5. 
Moreover, predictions from res-3 (16x16x80 for CMC cells) match better with 
the experiments than the others at position z/ D = 15 which show a great advantage 
of small CMC cells in this resolution. A reason may relate to the size of CMC cell 
which the big size of CMC cell may predict inaccurately in which a high level of 
mixture fraction gradient occurs. However, an increasing CMC resolution from res-2 
to res-3 requires more computational time than 60 %. Considering the computational 
time and results from all CMC resolutions, the appropriate resolution is res-2 (8 x 
8 x 80 for CMC cells) for Flame D. 


2.4 Summary 

In this section, parametric studies of LES-CMC are carried out for the Sandia 
Flame D. The parameter studies comprised investigations of flow and mixing field, 
variants of the CMC combustion model parameters and CMC grid resolution. 

Flame D is used to investigate the influences of various flow and mixing field 
parameters on the simulation results. These parameters, which are Sc = 0.7, 
Sc t = 0.4 and Ct = 0.2, are optimal values and thus, they are used for further 





Conservative Implementation of LES-CMC for Turbulent Jet Flames 


171 


z/D=3 z/D=3 



Fig. 8 Radial profiles of mean temperature and CO for Flame D. Symbols are experimental data 
[4], while the solid , dashed and dotted lines present the results of LES-CMC from the different 
CMC grid resolutions (Table 3) 


studies of Flame D, as well as for Flames E and F. Values of velocity variance of 
inflow generator, however, depend on the physical inflow of each flame. It should be 
noted that an adjustment of the velocity variances is carried out to reduce the high 
level of turbulence which may come from the implementation of inflow generator. 
Suitable inflow variances for Flame D are found to be | of the measured variances 
at z/D = 0.14 of these flames, respectively. 

Parameter studies of different variants of the CMC combustion model are 
carried out to find the most suitable model. Initial studies of the CMC fluxes have 
shown that the effect of turbulent diffusivity modelling is negligible. However, a 
comparison of CMC models, which varies the CMC formulation (conservative vs. 
non-conservative), reveals considerable differences. Moreover, some slight differ¬ 
ences between two methods of CMC convective flux approximation (cell face vs. 
cell centre based) are detected. Therefore, three dominant cases which differ in both 













172 


P. Siwaborwom and A. Kronenburg 


Table 4 Test cases for the LES-CMC model 



Test cases 




LES cells 

CMC cells 

Flux implementation 

Bluff-body flame, HM1 

218 x 218 x 320 

16 x 16 x 80 

24 x 24 x 80 

Various 

4 lifted flames 

96 x 96 x 480 

16 x 16 x 80 

16 x 16 x 160 

Various 


numerical aspects are investigated for further studies. Conditional mean scalars, 
which are averaged in the same cross section, show that the conservative CMC 
formulation with computing convective fluxes based on LES cells located at the 
CMC cell faces (case-1) is similar to the one with computing convective fluxes based 
on CMC cell centers (case-3). This can be explained by the low FDF values where 
the differences of predictions occur in a CMC cell. Consequently, the conditionally 
averaged predictions with FDF weighting create the similar results over a cross 
section. Generally, conditional predictions reveal that case-1 can capture better mean 
measurements than case-2 (the non-conservative formulation using the same flux 
approximation). This is because the variation of FDF-weighted convective fluxes in 
different directions of case-1 allows the predictions to be more accurate. 

Three different CMC grid resolutions (4x4x80, 8 x 8 x 80 and 16 x 16 x 80) are 
examined in order to find the appropriate number of CMC cells for Sandia Flame D. 
Basically, the best predictions are found in CMC grid resolution of 16 x 16 x 80. 
However, the reasonable resolution for Flame D is 8 x 8 x 80 CMC cell due to the 
computational cost with efficient performance. 


3 Future Study Cases 

In order to show advantages of conservative CMC, more complicated flames are 
required for the simulations. Therefore, future test cases will be the Sydney bluff- 
body flame, HM1, with 218x218x300 cells for LES for two CMC mesh sensitivity 
studies. Moreover, four lifted flames investigated at Berkeley [5,6] and Calgary [7] 
will be examined (two Berkeley flames and two Calgary flames). The summation of 
all test cases can be found in Table 4. 


4 The Usage of Computational Resources 

The fine grid simulation results, which are presented here, have been performed on 
NEC Nehalem Cluster with 80 processors due to scalability tests for LES-CMC. 
The using of parallel program MPI and vectorization compiler let the code runs 
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faster. An approximate wall time is around 48 h per summited job. The analysis of 
all parametric studies from Table 1, of Flames D, E and F corresponds to 450,000 
CPU hours. The amount of CPU use for Sandia Flame series is appropriate to the 
requirements of computational resources, 450,000 CPU hours for all test cases in 
future. 
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Numerical Investigation of a Complete Scramjet 
Demonstrator Model for Experimental Testing 
Under Flight Conditions 


Yann Simsont, Peter Gerlinger, and Manfred Aigner 


Abstract In the present paper a complete scramjet demonstrator model for exper¬ 
imental testing at Mach 8 is investigated numerically using the scientific code 
TASCOM3D (Turbulent All Speed Combustion Multigrid Solver) on the HPC 
vector system NEC SX-9, installed at the High Performance Computing Center 
Stuttgart (HLRS). First the three-dimensional intake of the model is simulated. 
Then the results are used as inlet conditions for the simulation of the combustor, 
where hydrogen is injected by a lobed strut injector located in the middle of the 
diverging chamber. The test conditions, corresponding to a flight speed of Mach 8 
at an altitude of 30 km, and the results of the simulations are discussed in detail, 
occurring difficulties are highlighted. In order to ensure self-ignition and prevent 
the flow from blockage, potential design changes are described and investigated in 
order to proof their functionality. Finally the performance of the reactive and non¬ 
reactive simulations on the NEC SX-9 is analyzed. 


1 Introduction 

For future hypersonic transportation the use of air breathing engines (i.e. ramjets for 
flight Mach numbers 2-7 and scramjets for flight Mach numbers 5-15) is of great 
interest. Contrary to rocket driven systems no oxygen is transported. Therefore air 
breathing engines provide the opportunity to increase the payload to total mass ratio, 
and thus reduce the cost per payload unit. The main objective of the Research Train¬ 
ing Group GRK 1095/2 (University of Stuttgart, Technical University of Munich. 
RWTH Aachen University and DFR Cologne) is the design and development of 
a complete scramjet demonstrator model. Experimental testing of the model is 


Y. Simsont (El) ■ P. Gerlinger • M. Aigner 

Institut ftir Verbrennungstechnik der Luft- und Raumfahrt, Universitat Stuttgart, Pfaffenwaldring 
38-40, 70569 Stuttgart, Germany 
e-mail: yann.simsont@dlr.de 


W.E. Nagel et al. (eds.), High Performance Computing in Science and Engineering ’12, 
DOI 10.1007/978-3-642-33374-3_15, © Springer-Verlag Berlin Heidelberg 2013 


175 



176 


Y. Simsont et al. 



Fig. 1 Scramjet demonstrator model in the hypersonic test facility IT-302 

funded by the DFG grant GA 1332/1-1 and conducted in two hypersonic wind 
tunnels (IT-302 and AT-303) at the Khristianovich Institute of Theoretical and 
Applied Mechanics (ITAM), Russian Academy of Sciences, Siberian Branch in 
Novosibirsk, Russia. A first testing period has been accomplished in October and 
November 2011, a second testing period has been performed in March and April 
2012. Figure 1 shows the complete scramjet demonstrator model mounted in the 
IT-302 wind tunnel. In the present paper the final numerical simulations before 
experimental testing are described. Those simulations are directed at distinguishing 
unfavorable flow conditions, which might occur during testing (e.g. blockage of the 
intake, deficient self-ignition in the combustor, thermal blockage, etc.), as well as 
suitable correctives to avoid those phenomena and ensure successful experiments. In 
respect of the advanced stage of the project - the scramjet demonstrator model had 
already been assembled - potential design modifications derived from the numerical 
results have to be easy to implement. 


2 Geometry and Test Design 

The investigated scramjet demonstrator model consists of an intake, an isolator, a 
combustor with central strut injection and a nozzle (Fig. 2) and has a total length 
of 1.046 m. The intake combines a single outer compression ramp (15.5° angle) 
and side wall compression (3.5° angle) and has been developed based on a 3D 
mixed compression intake tested at ITAM [1] and various numerical simulations 
[2], Gaseous hydrogen is injected at x = 623 mm downstream of the ramp leading 
edge in axial flow direction at the trailing edge of a lobed strut injector through seven 
horizontal and six vertical ports shown in Fig. 3. The strut is mounted centrally in 
the model from one sidewall to the other and corresponds to previously investigated 
lobed strut injectors [3,4]. The lobed structure creates streamwise vortices and 
therefore enhances the mixing of fuel and air. The combustion chamber has a 
rectangular cross section, the top and bottom walls diverge - each with an angle 
of 2° - beginning at x = 580 mm (at half length of the strut). 
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Fig. 2 Drawing of longitudinal section of the scramjet demonstrator model with instrumentation 


air 




65 mm 


Fig. 3 Sketch of central lobed strut injector (top) and areas of hydrogen injection at the trailing 
edge of the injector highlighted in blue ( bottom ) 


Table 1 Inflow conditions for experimental testing at Mach 8 


Air conditions 

A 

B 

Mach number Maoo (—) 

8 

8 

Velocity Uoo (®) 

2,472 

2,201 

Total pressure p 0 (bar) 

110 

110 

Static pressure p^, (Pa) 

1,127 

1,127 

Total temperature To (K) 

3,280 

2,600 

Static temperature Too (K) 

238 

188 

Wall temperature T„ a n (K) 

293 

293 


The intake ramp is equipped with 15 thermocouples and two static pressure 
transducers along the center line. Over 30 pressure transducers are integrated in 
the top and bottom wall of the combustor and a Pitot rake is mounted at the end of 
the combustor to measure the Mach number at 5 points of the cross section. The test 
conditions, corresponding to a flight speed of Mach 8 at an altitude of 30 km, are 
summarized in column A of Table 1 . Column B lists the conditions for a lower total 
temperature To also tested experimentally. Since the measurement duration in the 
wind tunnels is relatively short (about 100 ms) the wall temperature of the model is 
assumed to be constant. To benefit from the symmetry of the investigated scramjet 
model and economize computation time, only one half of the model is simulated 
using approximately 24 million structured grid cells (14.4 million for the intake and 
9.6 million for the combustor). 


































































178 


Y. Simsont et al. 


3 Governing Equations and Numerical Scheme 


The investigations presented in this paper are performed using the scientific 
code TASCOM3D. The code has been used successfully in the last two decades 
simulating reacting and non-reacting flows. It describes reacting flows by solving 
the full compressible Navier-Stokes, species and turbulence transport equations. 
Additionally an assumed PDF (probability density function) approach is used to 
take turbulence chemistry interaction into account. The set of averaged equations in 
three-dimensional conservative form is given by 


9Q 9(F - F y ) 9(G — G y ) 9(H - H y ) 
dt dx dy dz 


( 1 ) 


where 

Q=\p,pii,pv,p\v,pE,f>q,pw,po T ,p(jY,pYj\ , i = 1,2, ..., Nk — 1 . (2) 

The variables in the conservative variable vector Q are the density p (averaged), the 
velocity components (Favre averaged) u, v and vv, the total specific energy E, 
the turbulence variables q = \fk and a> = e/ k (where k is the kinetic energy and 
e the dissipation rate of k), the variance of the temperature a T and the variance of 
the sum of the species mass fractions ay and finally the species mass fractions Y t 
(i = 1,2,..., Nk — 1). Thereby /V/ f describes the total number of species that are 
used for the description of the gas composition. The vectors F, G and H specify 
the inviscid fluxes in x-, y- and ^-direction, F v , G, and H,, the viscous fluxes, 
respectively. The source vector S in Eq. (1) includes terms from turbulence and 
chemistry and is given by 

S = [0,0,0,0,0, S q ,S a ,S OT , SayJn f , i = l,2,...,N k -1 , (3) 

where S q and S„, are the averaged source terms of the turbulence variables, S„ T and 
S ay the source terms of the variance variables (oj- and ay) and Sy j the source terms 
of the species mass fractions. For turbulence closure a two-equation low-Reynolds- 
number q-w turbulence model is applied [5]. The momentary chemical production 
rate of species i in Eq. (3) is given by 


N r 

MiY, 


r— 1 



(4) 


where k p and k/, r are the forward and backward rate constants of reaction r (defined 
by the Arrhenius function), the molecular weight of a species M,-, the species 
concentration c,- = pYj/Mj and the stoichiometric coefficients v- and v ir of 
species i in reaction r. The averaged chemical production rate for a species i due 
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to the use of an assumed PDF approach is described in detail in Refs. [6,7]. In 
the present paper the reactive simulations have been performed using a modified 
Jachimowski hydrogen/air reaction mechanism with 9 species and 19 steps [8,9]. 
The unsteady set of differential equations in Eq. (1) is solved using an implicit 
lower-upper symmetric Gauss-Seidel (LU-SGS) [8, 10-12] finite-volume algorithm, 
where the finite-rate chemistry is treated fully coupled with the fluid motion. More 
details concerning TASCOM3D may be found in Refs. [7, 8,12-14]. 


4 Investigation of the 3D Intake 

In this section the three-dimensional intake and the subsequent isolator are investi¬ 
gated using the experimental test conditions listed in Table 1 in column A. The main 
focus is the generation of convenient flow conditions to enable self-ignition in the 
combustor while any blockage must be prevented. Figure 4 shows the Mach number 
distribution in the central plane of the model. The compression of the incoming air is 
effected by two strong shock waves: the first is the leading ramp shock which shortly 
misses the cowls lip, the second shock wave originates at the cowl and impinges on 
the thick boundary layer on the bottom wall of the model close to the central strut 
injectors leading edge. The interaction of the cowl shock and the boundary layer 
induces a separation zone with low Mach numbers shown in detail in the lower 
part of Fig. 4. Although the extent of the separation in the spanwise cross section is 
relatively small, the flow beneath the central strut injector is perturbed. In order to 
avoid the impingement of the cowl shock on the boundary layer a passive suction 
slot is added to the intake model at x = 525 mm. 

To determine the size of the opening two different designs are investigated 
numerically. Suction I is 15 mm long and has an angle of 45° at the leading edge 
and 20° at the trailing edge, suction II is 25 mm long, has the same angle of 
45° at the leading edge, though an angle of 37° at the trailing edge. The Mach 
number distribution in the central plane upstream of the strut is given for both 
slot configurations in Fig. 5. As intended suction I as well as suction II capture 
the cowl shock. Consequently the separation zone is successfully eliminated by 
the suction. Table 2 lists the air mass flows for the intake and the two suction 
slot configurations. The loss of compressed air exiting through the suction slot is 
minor for both slot designs with only 1.6% (suction I) and 4.6% (suction II) of 
the total captured mass flow. Because of the enhanced length of the slot, suction II 
is considered to be more tolerant towards any unsteadiness or modification of the 
simulated test conditions and is therefore recommended for the experimental testing. 
The conditions at the interface between isolator and combustor (cross section at half 
length of the central strut injector, where x = 580mm) shown in Fig. 6 represent 
a strongly three-dimensional and non-uniform flow field due to the 3D intake. 
The average levels of pressure ( p av = 0.522 bar) and temperature (T fll , = 1,090 K) 
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Fig. 4 Mach number distribution in the central plane of the intake (top) and detail of the impinging 
cowl shock at the bottom wall causing a separation zone (bottom) 
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Fig. 5 Mach number distribution in the central plane for suction I (left) and suction II (right) 
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are relatively low, while the Mach number is high (Mao,. = 3.02). With respect to 
the aspired self-ignition the conditions are particularly poor in the region directly 
beneath the strut where the temperature is about 800 K and less. These unfavorable 
conditions are intensified downstream of the interface, where the divergent geometry 
of the combustor further accelerates the flow and decreases the level of temperature. 
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Table 2 Calculated air mass flows for intake simulations with 


passive suction 


Total (f) 

Relative (%) 

Mass flow intake m,„^ DW 

636 

100.0 

Mass flow suction I m suction / 

10 

1.6 

Mass flow suction II m suct i on // 

29 

4.6 



Fig. 6 Mach number (left), temperature ( middle) and pressure (right) distribution at the cross 
section at x = 580 mm (interface between intake and combustor simulation), respectively 


Table 3 Inflow conditions for the hydrogen injection into the 
combustor (horizontal and vertical ports) 


Hydrogen 

Mach number Majj 2 (—) 

2.15 

Equivalence ratio <f> 

0.6 

Mass flow m Hl (|) 

10.8 

Total pressure 0 (bar) 

7.0 

Static pressure p H2 (bar) 

0.7 

Total temperature Th 2 ,o (K) 

293.0 

Static temperature T# 2 (K) 

145.0 

Wall temperature strut T wa u (K) 

293.0 


5 Investigation of the Combustor 

The conditions for the hydrogen injection are given in Table 3, the conditions for 
the incoming air flow at the interface are adopted from the intake simulation. In 
consequence of the suboptimal conditions at the interface discussed at the end of 
Sect. 4 and the further decrease in temperature due to the diverging combustion 
chamber, no self-ignition is achieved during the simulation of the combustor. In 
order to enforce ignition with a design change still feasible at this stage of the 
project, ignition wedges are integrated in the model. Placed at the top and bottom 
walls of the combustor close to the trailing edge of the strut injector, they are 
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Fig. 7 Sketch of the ignition wedges and their position in the model 



Fig. 8 Pressure (top) and temperature (bottom) distribution in the central plane of the combustor 
with ignition wedges (non-reactive simulation), respectively 

designed to initiate a shock system that locally increases the temperature close to 
the injection and, thus, enables ignition. Figure 7 sketches the geometry and the 
position of the ignition wedges. 

The evolving shock system due to the ignition wedges is observed in the pressure 
and temperature distributions in the central plane in Fig. 8 (non-reactive simulation). 
Enforced by a reflecting shock wave in the upper region of the incoming flow, the 
shock wave originating from the front ramp of the top wedge is more dominant 
than the one generated by the bottom wedge. The top wedge shock crosses the 
cold injected hydrogen about 16 mm downstream of the injector’s trailing edge. 
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Fig. 9 H 2 {top), OH {middle) and temperature {bottom) distributions at cross sections of the 
combustor (reactive simulation), respectively 


thereby deflecting the fuel towards the lower half of the combustion chamber. 
Further downstream the shock wave is reflected by the bottom and top walls. 
The temperature distribution shows three local maxima above 1,500 K: at the front 
ramp of the top wedge, at about 30 mm downstream of the injector and nearby the 
rear ramp of the bottom wedge. The upper part of Fig. 9 shows the FF distribution for 
the reacting flow at several cross sections of the model. The deflection of hydrogen 
by the shock waves is observed as well as the mixing of fuel and air supported by 
the strong vortices which are induced by the lobed strut. At the hot spot close to the 
rear ramp of the bottom wedge, hydrogen and air are sufficiently mixed in order to 
ignite. Accordingly we observe a significant increase in the temperature and OFI 
mass fraction distributions close to the bottom wall (middle and bottom part of 
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Table 4 Computational performance of the simulations for the ITAM hypersonic wind tunnel 
experiments 



Intake 

Combustor 

non-reactive 

Combustor 

reactive 

Number of volumes (Mio) 

14.4 

9.6 

9.6 

Number of species 

2 

3 

9 

Number of iterations 

140,000 

130,000 

80,000 (on basis of non¬ 
reactive solution) 

Vector op. ratio (%) 

98.16 

98.68 

99.12 

Average vector length 

105.3 

191.8 

212.5 

MFLOPS 

4,329 

4,912 

7,204 

Quota peak perf. (%) 

4.2 

4.8 

7.0 

Wall-clock time/iter, (s) 

1.008 

0.744 

1.538 

Total CPU time (h) 

627 

430 

547 


Fig. 9). Further downstream the flame spreads over the whole cross section and a 
non-uniform, detached flame develops. Therefore the simulation gives evidence of 
the successful use of the ignition wedges, while no thermal blockage is observed. 
Due to the short experimental testing times the high temperatures close to the bottom 
wall are not considered to be critical for the model. Numerical simulations with a 
reduced total temperature (inflow conditions listed in column B in Table 1) indicate 
very low temperatures and high Mach numbers at the interface between intake and 
combustor (T av = 852 K, Ma fll , = 3.14). In spite of these unfavorable conditions, 
the ignition wedges again increase the temperature locally and enforce ignition. 


6 Performance Analysis 

The numerical investigations of the intake and the combustor have been performed 
on the NEC SX-9 at the High Performance Computing Center Stuttgart using 1 node 
and 16 CPUs. Table 4 gives an overview of the performance for the non-reactive 
and reactive simulations. The average vector length is longer for simulations with 
a high number of species. Accordingly the best vector performance is achieved 
by the reactive simulation of the combustor using nine species (vector operation 
ratio of 99.12%). The low quotas of peak performace on the NEC SX-9 (4.2- 
7.0 %), particularly in comparison to the previous vector-processor based HPC, the 
NEC SX-8, have already been discussed in Ref. [15]. 

Table 5 lists the five most time consuming subroutines (using about 73.6% of 
the computational time) for the reactive simulation of the combustor and their per¬ 
formance data. All subroutines show high vector operation ratios (98.69-99.87 %), 
but the performance varies between 3,413 and 11,247 MFLOPS (3.3-11.0% peak 
performance) due to significant differences with regard to the bank conflicts. On 
the right hand side (RHS) of the set of equations subroutine PROP calculates the 
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Table 5 Performance data for the most time consuming subroutines for the reactive simulation of 
the combustor 


Subroutine 

Time 

(%) 

MFLOPS 

Vec. op. 
ratio (%) 

Av. vec. 
length 

Bank 

confl. 

Quota peak 
perf. (%) 

PROP 

20.8 

8,035 

98.69 

255.9 

1.74 

7.8 

LINE3D 

17.5 

3,413 

99.25 

223.3 

534.54 

3.3 

REACTION 

16.8 

11,247 

99.41 

240.0 

22.93 

11.0 

LFSWEEP 

9.3 

3,627 

99.87 

219.9 

647.65 

3.5 

UFSWEEP 

9.2 

3,635 

99.87 

219.7 

647.90 

3.5 


gas properties and subroutine REACTION the chemical source terms. As these 
are local phenomena, both subroutines only depend on the data of each volume. 
On the contrary the implicit left hand side (LHS) of the set of equations is solved 
using an implicit lower-upper symmetric Gauss-Seidel solver (LU-SGS) [8, 10-12], 
In order to resolve the data dependency from neighboring cells of the structured 
i,j,k -ordered grid the LHS subroutines LINE3D, UFSWEEP and LFSWEEP 
are vectorized along hyperplanes defined by / + j + k= const, using indirect 
addressing. This probably causes memory latencies and therefore more bank 
conflicts per iteration (0.5345-0.6479 for the subroutines LINE3D, UFSWEEP and 
LFSWEEP in comparison to only 0.0017-0.0229 for the subroutines PROP and 
REACTION). 


7 Conclusion 

A complete scramjet demonstrator model designated for wind tunnel testing at 
Mach 8 has been investigated numerically. A separation zone evolving from the 
interaction of the cowl shock and the boundary layer at the bottom wall of the 
model has been successfully eliminated by passive suction. The mass flow exiting 
the model through the suction slot has been evaluated. The first simulation of the 
combustor showed that the conditions were to cold for self-ignition. Therefore 
ignition wedges have been integrated in the model. It has been shown for two 
different test conditions, that the evolving shock system increases the temperature 
locally and enables self-ignition and a stable combustion. 

The performance of the simulations of the intake and the combustor on the 
NEC SX-9 system has been analyzed. A good vector performance has been shown, 
especially for the reactive simulation of the combustor. Still, the high number of 
bank conflicts resulting from the indirect addressing of the LHS has to be further 
investigated and decreased. When the experimental data is fully analyzed, further 
simulations reproducing the exact testing conditions have to be performed in order 
to compare the experimental and the numerical results quantitatively. Furthermore it 
is intended to port the simulations of the scramjet demonstrator model to the massive 
parallel scalar Cray XE6 system using more CPUs. 
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Application of the Unified Turbulent 
Flame-Speed Closure (UTFC) Combustion 
Model to Numerical Computation of Turbulent 
Gas Flames 


Feichi Zhang, Peter Habisreuther, and Henning Bockhorn 


Abstract The current work presents the numerical computation of turbulent reac¬ 
tive flow by means of three different classes of flame: a premixed, a non-premixed 
and a partially premixed flame. The aim thereby is to validate the unified turbulent 
flame-speed closure (UTFC) combustion model developed at our institute. It is 
based on the presumption that the entire turbulent flame can be viewed as a 
collection of laminar premixed reaction zones (flamelets) with different mixing 
ratios. The mixing process is controlled by the mixture fraction £ and the subsequent 
chemical reaction by the progress variable 6. The turbulent flame speed S, is 
used to describe the flame/turbulence interaction as well as the finite rate reaction. 
Complex chemistry is included and the pressure dependency (elevated pressure) of 
the combustion process is included in the model as well. The applicability of the 
model is explored by means of RANS (Reynolds averaged Navier-Stokes approach) 
and LES (large eddy simulation) methodologies at a wide range of Damkohler 
number Da. The results of all simulations show reasonable good agreement with 
the experiments. 


1 Introduction 

CFD (computational fluid dynamics) simulation of combustion systems is an 
important and reliable tool in designing and optimizing combustion devices like 
combustion engines or gas turbines. For such industrial flows with high Reynolds 
number Re, the resolution of all turbulent scales and the computation of all 
reacting species concentrations is not possible due to computational costs. For this 
reason, modeling is needed to simplify the underlying physics, namely: turbulence 
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and combustion. Since combustion occurs in complex 3D flows characterized 
by high turbulence intensities and thermal loads in industrial applications, the 
turbulent length scales are in the order of the typical laminar flame thickness 8 /, 
suggesting that the flamelet assumptions [1] are in most cases violated. Proper 
and comprehensive modeling therefore becomes important and is the objective 
of the work. There are numerous modeling concepts for turbulent combustion 
which usually underly different physical restrictions, for example, some are only 
applicable for premixed or non-premixed flames or for fast chemistry. 

For turbulence modeling, the classical RANS method is fast, but provides 
only information about the time mean variables of the flow and lack transient 
characteristics. On the other hand, the LES offers the possibility to resolve the 
unsteady flow structures down to a cut-off level and therefore is a good compromise 
between computational costs and additional accuracy. The main difficulty for 
LES modeling of turbulent combustion is that the theoretical flame thickness of 
0.1-1 mm is generally smaller than the LES mesh size. This phenomena is avoided 
in the current work by filtering the flame front. In doing so, the unresolved scalar 
transport increases the flame thickness, both through the turbulent and numerical 
diffusion so that the filtered or thickened flame can be resolved on the relatively 
coarse mesh [2], As a drawback, the response of the flame on turbulence fluctuations 
becomes less sensitive. 


2 Combustion Modelling 

The current work proposes a unified turbulent flame-speed closure (UTFC) com¬ 
bustion model, which was developed by the authors, for computation of turbulent 
reactive flows. It is an extension of the reaction model by Schmid [3] for modeling of 
non-premixed and partially premixed combustion [4]. The basic idea of the concept 
is to assume that the mixing of fuel and oxidizer takes place before the chemical 
reaction, so that the entire turbulent flame can be considered as a collection of 
distinct reaction zones with different, but individually fixed stoichiometries. The 
mixing process is controlled by the mixture fraction £ and the subsequent chemical 
reaction by the progress variable 6. 

To take the effect ofjurbulence on mixing into account, two transport equations 
for £ and its variance £" 2 are solved in addition to the basic conservation equations 
and the transport equation for the reaction progress 6 0 denotes Favre-averaged 
values). Pre-computed laminar flame structures are then averaged by a probability 
density function (PDF) P (£) determined by the mean value of £ and its variance 
£" 2 with a fixed principal shape [1]. All averaged (i.e. over the PDF integrated) 
quantities are then gathered in a look-up table with three^input parameters: 
the mixture fraction £, the variance of the mixture fraction £" 2 and the progress 
variable 0. 
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Pre-Processing CFD - Solver 



Fig. 1 Connection of the look-up chemistry table and the CFD solver in the UTFC approach 


Figure 1 illustrates the interaction between the look-up chemistry table and 
the CFD solver. The turbulent flowjield and the pre-computed flame structures 
are linked through exchange of £, £" 2 and 6 (by multidimensional interpolation). 
The reaction rate wg is computed by means of the turbulent flame speed S t from 
the turbulence parameters (turbulence intensity u !, turbulent time scale r t and 
turbulent length scale L,) and the thermophysical characteristics (laminar flame 
speed Si and density of unburned mixture p),) as a function of the mixing ratio. The 
species concentrations T, are needed to evaluate the enthalpy and the temperature, 
respectively. A more detailed description of the reaction model can be found in [4], 


3 Simulation Aspects 
3.1 Case Description 

To validate the proposed combustion model, it has been implemented into the open 
source CFD code OpenFOAM [5] which employs the finite volume method of a 
cell-centered storage arrangement. Subsequently, LES and RANS simulations of a 
premixed, a partially premixed, and a non-premixed flame have been carried out in 
sequence. All cases are operated at normal pressure po = 1 bar. 






























190 


F. Zhang et al. 


Fig. 2 Sketch of the matrix 
burner [6] 




CWi.=150mm 

D Sor ,=20mm 


• Premixed combustion (Matrix Flame) [6]: A natural gas flame with an air 
to fuel equivalence ratio of X = 1.75 (lean premixed) and a thermal load of 
Pti, = 275.6 kW with the unburned mixture preheated to 7o = 400°C (see 
Fig. 2). The Re based on the nozzle diameter (D = 0.15 m) and bulk velocity 
(Uhuik = 20.7 m) is Re sb 48.000. 

• Partially premixed combustion (Cabra Flame) [7]: A lifted jet flame with 
a mixture of 33 % natural gas and 67 % air by volume (fuel-rich premixed), 
issued from a central nozzle (ubuik = lOOm/s, D = 4.57 mm, To = 320K, Re = 
28.000) into a large coflow of hot combustion products from many lean premixed 
hydrogen/air flames ( u co = 5.4m/s, D co = 210mm, T co = 1,350K), as shown 
in Fig. 3. 

• Non-premixed combustion (Sandia H3 Flame) [8]: Fuel jet of 50% Hi and 

50 % N 2 in volume which was injected into an air coflow at 0.2 m/s (To = 300 K). 
The corresponding Re based on the nozzle diameter (D = 8 mm) was 10.000. 


3.2 Numerical Setups 

A fully implicit compressible formulation was used together with the pressure- 
implicit split-operator (PISO) for the pressure correction. The applied discretizations 
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Fig. 3 Schematic of the 
lifted coflow burner [7] 



Lifted 
Jet Flame 


Vitiated Coflow 
(2200 flames) 


TT f tt 

H 2 /Air CH 4 /Air H 2 /Air 
Coflow j et Coflow 


for the momentum equations were based on a second order central difference 
scheme for LES and a bounded linear scheme for RANS. To show the universality 
of the approach, only the standard turbulence models, i.e. the Smagorinsky sub 
grid scale (sgs) model [9] for LES and the standard k — e model [10] for RANS, 
were applied for the turbulence modeling. There, the turbulent Schmidt and Prandtl 
number were assumed as constant Sc t = Pr t = 0.7. 

For all LES simulations, a turbulence inflow-generator [11] was used to provide 
transient correlated velocity fluctuations at the inlet boundary for each time step. 
Moreover, the inflow generator has been applied in combination with a velocity 
profile for the inlet boundary condition. The near-wall region was modelled by wall 
functions. A non-reflecting boundary condition (NRBC) proposed by Poinsot and 
Lele [12] has been applied for the compressible LES/RANS to avoid nonphysical 
reflections of traveling pressure waves at the entrainment boundaries, x and r 
indicate the streamwise and radial axis of the cylindrical coordinate system. 

Since the round jet configuration is representative in our cases, the computational 
domains are chosen to consist of a final part of the nozzle to allow development of 
the turbulence in the burner and the velocity changes directly at the nozzle exit, and a 
cylindrical domain (Lx D) downstream of the burner where mixing and combustion 
take place. Due to the simple geometries of the computational domain, in order to 
achieve better accuracy, the computational grids are built up in a block-structured 
way employing hexahedral shaped elements. Table 1 shows some statistics of the 
used computational grids. In each case, the LES and RANS simulations have been 
computed on the same grid for comparison reasons, although it is not necessary, 
because only the time mean flow field is solved in RANS where a much coarser grid 
can be used. 
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Table 1 Mesh parameters and simulation setups used for the different flames 


Case 

Lx D 

N re ii ( million ) 

A mi „ (mm) 

Ninlet 

At (us) 

ttotal (s) 

Ntimesten 

Matrix Flame 

2.0m x 1.6m 

6.3 

0.5 

3,000 

5 

0.6 

120.000 

Cabra Flame 

1.0 m x 0.2 m 

4.3 

0.2 

900 

0.5 

0.1 

200.000 

H3 Flame 

0.6m x 0.4 m 

2.5 

0.2 

1,200 

2.5 

0.3 

120.000 


All simulations are initialized by a uniformly distributed flow. The first stage is 
to get a fully developed flow field without reaction. In the second step, the non¬ 
reactive flows have been ignited and develop further to a turbulent flame. Statistics 
of the flow parameters are collected after certain volumetric flow-through times in 
this stage. The time step At and the total running time t lola i for the LES simulation 
are chosen to keep the CFL no. smaller than 0.4 and to get a well averaged flow field 
(see Table 1). 


3.3 Computational Effort 

Since the computational meshes used are fine (see Table 1) and also additional 
transport equations have to be solved to include chemical reactions, the in-house 
Linux cluster (with a Gigabit Ethernet Interconnect) at the Division of Combustion 
Technology is not able to speed up efficiently by increasing used number of 
processors. For this reason, all LES/RANS simulations have been conducted on the 
HP XC4000 cluster of the Steinbuch Centre for Computing (SCC) at the KIT [13] 
whose large computational resources enable such a comprehensive comparison of 
different models on such fine grids. 

In our cases, the number of processors used are typically chosen to run a 
whole simulation as one job using the maximal running time of 4,320 min. For 
LES simulations, where at least 100.000 time steps are necessary to get a well 
averaged flow field, 128-196 processors depending on the mesh size have been 
used to achieve a performance of approximately 3 s per time step in LES. The 
computational effort for RANS is much lower than that for LES, even though the 
same mesh is used for both methods, because the LES has to be run for a long 
time in order to gain statistical convergence. A performance study on parallelization 
has been done for the finest grid with 6.3 mio. cells. By use of one single node 
(4 processors), the solver needs 76 s for one time step. Running the same case on 
49 nodes (196 processors), a speed-up factor of 25 (=3s) was achieved, which, 
however, corresponds to an efficiency factor of only 55 %. 


3.4 Chemistry Tabulation 

For combustion modeling using the UTFC concept (see Sect. 2), the CHEMKIN II 
program package [14] has been used for all cases to calculate the internal flame 
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Fig. 4 Species profiles versus the flame coordinate {left) and the progress variable {right) at A = 
1.75 and Tq = 400 °C (Matrix Flame). The resulting laminar flame speed Si was thereby 0.805 m/s 
in this case 


structure applying a detailed reaction mechanism: the GRI 3.0 (Gas Research 
Institute) mechanism containing 325 reactions and 53 species [15] for methane/air 
combustion and the reaction mechanism of Maas and Pope [ 1 6] containing 9 species 
and 27 elementary reactions for combustion of hydrogen with air. In Fig. 4, mass 
fraction profiles in the flame coordinate _y,- (x) and in the 0-space y, (9) used for the 
Matrix Flame are illustrated: at the top for the main species (0[v,] = 10 -2 — 10 _1 ) 
and at the bottom for some intermediate species (<9[y,] = 10~ 3 — 10 4 ). 

Figure 5 gives an overview of the tabulated laminar flame speed S/(£, £" 2 ) for the 
Cabra Flame. The influence of the hot coflow on the flame speed was incorporated 
by construction of the look-up table, where each of the flame computations was 
conducted with a preheating temperature attained due to the mixing of the fuel 
jet with the hot coflow gas. Since Si is strongly dependent on the temperature, it 
increases dramatically by the mixing process and comes up to a maximal value of 
3.1 m/s close to the stoichiometric mixture fraction of % st = 0.177. 

Figure 6 shows the integrated table for water vapoury h 2 o(^^' 2 ,9) used for 
modeling of the Sandia H3 Flame, the surfaces from the lowest horizontal position 
to the highest location indicate values of 6 from 0 to 1 or from unburnt to completely 
burnt state with an increment of A6 = 0.2. Note that £ and % nl are allocated not 
equidistantly as indicated by the surface patches. 
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Fig. 5 Chemistry look-up 
table for the laminar flame 
speed used for the Cabra 
Flame 




Fig. 6 Look-up table 
~h 2 o{%, £" 2 , 0) used for 
combustion modeling of the 
FI3 Flame 


4 Simulation Results 


4.1 Premixed Combustion Modeling 

In Fig. 7, slices (passing through the centre symmetry axis) of the computed time 
mean temperature T, streamwise velocity u, and mass fraction of methane ycm 
from LES and RANS simulations are compared with the measured data. A conical 
shaped flame front stabilizing at the burner rim can be identified. It exhibits a 
length of about 0.3 m. The RANS predictes a longer flame than the LES and the 
experiment, which can be attributed to the k — e model, where the turbulence 
quantities are underestimated leading to a smaller burning speed and a weakened 
reaction rate d>e, respectively. In contrast, the length and the thickness of the flame 
provided by LES has a good agreement with the experiment. 

At the bottom left of Fig. 8, instantaneous contours of the streamwise velocity 7T 
at the inlet BC provided by the inflow generator is illustrated, which forms intense 
turbulence upstream of the burner exit. The influence of the inflow turbulence can 
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Fig. 7 Slices of computed 
and measured time mean 
variables: LES/Exp./RANS 



also be identified in contour-plots of 7T (top left of Fig. 8) and the vorticity a> = 

| rot x TT| (bottom right of Fig. 8). The black curves in u— and T —contours indicate 
the flame surface defined by 6 = 0.5. The flame front is strongly corrugated and 
torn by the intense turbulence from the inflow. Flowever, it is attenuated by passing 
through the flame surface due to the strong acceleration caused by the gas expansion 
and the increased fluid viscosity. The hot products mix thereafter with incoming 
cold air which again enhances turbulence, as indicated by the vorticity field (the 
white curve is the isoline by T= 1,700 K). The temperature increases rapidly in the 
reaction zone (indicated by the isoline of 6 = 0.5) and decreases then by mixing 
with the ambient air, which is shown at top right of Fig. 8 in the contour plot of 
instantaneous temperature. 

Figure 9 shows a detailed comparison of the computed radial profiles of the time 
mean values with the experiment for six axial positions (from left to right: the tem¬ 
perature T, the streamwise velocity u, the radial velocity v as well as the Reynolds 
stresses R xx = u'li' and R yy = vV). Note that only the resolved Reynolds stresses 
are used here, because the sgs part is smaller in magnitude compared to the resolved 
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Fig. 8 Instantaneous flow and temperature fields derived from the LES computation 
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Fig. 9 Comparison of computed and measured radial profiles at different streamwise positions 
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Fig. 10 Effect of the inflow turbulence on the flame length 


one due to usage of the fine grid (the sgs turbulent kinetic energy k sgs oc A 2 ). The 
overall agreement for the velocities and the temperature field is very good. Even 
the Reynols stresses are reproduced very well. This indicates that the turbulence 
generator applied for computing the velocity components at the inlet works well. 
However, the flame length from LES is slightly shorter than the measured one (see 
also Fig. 7) which may be attributed to the overpredicted R xx , for example, at the 
position r = 0.3m. To the contrary, the Reynolds stress R xx in the streamwise 
direction from RANS is largely underestimated since the k — e model assumes 
isotropy of turbulence whereby the normal stresses are almost equal. This leads 
consequently to an over-predicted flame length. Despite of these differences, the 
results are satisfactory as the flame is preheated and operated at very high turbulence 
level, so that the interaction of the flame with the turbulence is difficult to capture 
(Da < 1). 

It is important to note that the precise setting of the turbulence conditions at the 
inlet boundary plays a decisive role for the current case. In order to demonstrate 
this, another simulation using the same numerical setups is applied to the matrix 
flame, however, applying a laminar inflow condition. As shown in Fig. 10, LES with 
a steady inflow condition (right) exhibits a rather uniform and laminar flame front 
due to the missing turbulence and predicts a significantly longer flame than LES 
using a turbulent inlet BC (left). In this case, the enhanced turbulence leads to a 
higher turbulent flame speed and a strongly increased reaction rate cbg, so that the 
flame is shorter. 

Figure 11 demonstrates the influence of the mesh sizes on the resolved flame 
brushes computed by LES where the black curves illustrate isolines of 6 = 0.1- 
0.9 with the increment of A6 = 0.1. It is obvious that the mesh resolution 
imposes a grave impact on the turbulence/flame interaction which can be identified 
by the corrugated and torn flame surfaces. As indicated in Fig. 11, the flame 
front provided by LES on the coarse mesh with 0.66 million elements is less 
sensitive to turbulent fluctuations than the one that is result from the finer meshes. 
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Fig. 11 Effect of mesh size on the resolved flame front from LES of the matrix flame 


Also, LES on a finer mesh resolves a somewhat thinner and longer flame due 
to the attenuated turbulent and numerical diffusion (v,\D t ocA 2 according to the 
Smagorinsky model). However, it is noteworthy that the time mean properties, for 
example, the length and thickness of the flame, should be much more independent 
on this effect since these are governed by the mean flow rate (of fuel) and resolvable 
even using a coarse mesh. 


4.2 Modeling of the Lifted Partially Premixed Flame 

Figure 12 shows instantaneous contours of the temperature and the OH -mass 
fraction from the LES. A lifted flame with an inner and an outer reaction zone 
can be identified clearly, which are sustained by a premixed and a non-premixed 
combustion. The reaction zones due to these two types of flame cause an inner 
strong and an outer weak formation of OH or heating of the flow. The diluted 
mixture is pre-burned by a higher reaction rate from the premixed combustion, the 
remaining fuel mixes thereafter with oxygen from the vitiated coflow and results 
in a weak diffusive flame. Note that this diffusion flame is not located exactly at 
the stoichiometric surface because it is sustained by mixing of the flue gas from the 
previously occured premixed combustion and incoming oxygen from coflow. 

With respect to the underlying physics, the laminar flame speed increases by 
mixing with the hot coflow (see Fig. 5). Despite of that fact, the flame cannot 
stabilize or the reaction can not take place directly after the burner mouth due to 
the very small turbulent time scale or large strain generated at the shear layer, i.e. 
6)q —> 0 by r, or Da —> 0. Further downstream, the fuel jet mixes with the coflow 
and becomes heated as well due to the high temperature of the coflow (1,350 K), 
so that the flame speed increases strongly. At the same time, the flow is slowed 
down by the turbulent diffusion process. As a result, there is a counteraction of 
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Fig. 12 Instantaneous contours of temperature and_OH-concentration from the LES computation: 
white cuive indicates flame surface defined by 9 = 0.5; black curve denotes isoline of the 
stoichiometric mixture fraction £ = 0.177; horizontal lines mark positions of the measured radial 
profiles at x/d= 15/30/40/50/70 


increasing burning speed and stalled flow motion. As soon as the local flame speed 
and the flow velocity are in balance on a certain downstream location, the flame 
is stabilized there. All these phenomena of a lifted partially premixed flame are 
accounted for by the UTFC combustion model, so that the stabilization mechanism 
can be captured properly. 

In Fig. 13, the mean and root mean square (rms) values derived from the LES 
show quantitatively good agreement with the measurement. The small differences 
between the LES and the measured data may be caused by the used turbulence 
conditions at the inlet boundary, which was unknown from the experiment. Despite 
this uncertainty at the inlet BC, the diffusion flux from the RANS computation 
is underestimated which leads to higher £ at x/d= 15 and x/d =30. This may 
be caused by the standard k — e turbulence model due to its isotropic turbulence 
assumption [10]. Consequently, the lift-off height is predicted too high in the RANS 
computation, which can also be identified by the lower temperature level at x/d = 40 
and x/d = 50 (dashed line). 
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Fig. 13 Comparison of measured and computed time mean and rms profiles of mixture fraction 
(left) and temperature (right) at different stream wise positions as indicated in Fig. 12 


Figure 14 shows a comparison of the centerline distributions of the LES/RANS 
simulations with the experiment. The agreement between the predicted and 
measured data is very good. £ from LES is slightly smaller than the measured 
data leading to a higher T, which may be attributed to the enhanced numerical 
diffusion due to the used relatively coarse grid (v t oc A 2 ). The RANS simulation 
overpredicts the lift-off position, so that an increment of temperature occurs further 
downstream. Despite of these differences, the UTFC is an able concept to be 
used for modeling partially premixed flames at extinction limit and provides even 
reasonable good solutions. 
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Fig. 14 Comparison of centerline profiles from measurement and CFD simulations 



Fig. 15 Instantaneous contours of the temperature T, the turbulent flame speed S t and the sub grid 
scale (sgs) Jurbulence intensity u! from the LES simulation. The flame surface is indicated by 
isolines of £ — %st — 0-3 in each subplot as well. The horizontal line marks the position of the 
measured mean flame length at L f = 34 D 


4.3 Non-premixed Combustion Modeling 

In Fig. 15, snapshots from the LES computation give an insight into the resolved 
reactive flow field. The computed flame exhibits a length of Lles = 33 D which is 
equal to the predicted flame length from RANS, but slightly below the measured 
one. This may be attributed to the enhanced numerical and turbulent diffusion 
caused by the relatively coarse grid (of 2.5 mio. cells). The interaction of the flame 
with the turbulent flow can be identified by the corrugated and disrupted flame 
surface (denoted by isolines of £ = £*)• As illustrated in Fig. 15 on the right (u'- 
contour), there are many vortices generated along the shear layer and the flame 
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Fig. 16 Comparison of the measured and computed time mean profiles of (from left to right) 
mixture fraction, temperature, streamwise velocity and turbulent kinetic energy 


surface which cause large turbulence intensities and, thus, flame speed (see S t - 
contour) in these regions too. The temperature and the flame speed are strongly 
correlated to the mixing field due to the pre-defined look-up tables. The maximum 
value of S t is located somewhat at the fuel rich region £ > 0.3, because H 2 has a 
much higher diffusion coefficient than air. In addition, regions of large flame speed 
and temperature are located far downstream due to the large strain and the high 
variance £" 2 caused at the root of the flame or the shear layer. 

In Fig. 16, the calculated mean profiles show good agreement with the experi¬ 
mental data [8]. The turbulent kinetic energy (in the last column) is over-predicted 
close to the burner (x/D = 5) in both RANS and LES simulations which may 
be caused by the used inlet BC, where the turbulence properties were not exactly 
known from the experiment. Also, the wall thickness of the nozzle was assumed 
as infinitely thin which causes a higher strain between the jet and the coflow. The 
turbulent diffusion and dissipation in LES are somewhat overestimated due to the 
relatively coarse grid used, leading to profiles of £ and u (in the first and third 
column) which lie somewhat below the measured data at x/D = 20. This explains 
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the slightly higher temperature (in the second column of Fig. 16) in this region. 
Compared to the computations presented in the early work [4] using a coarser mesh 
and otherwise the same conditions, the RANS simulations performed on different 
grids (1 vs. 2.5 million cells) exhibited results of very similar quality. In contrast, 
the flame length was considerably underestimated by LES on the coarse mesh in [4] 
due to the enhanced numerical and turbulent diffusion. This indicates that the RANS 
method is not as sensitive as LES to mesh resolutions and the k — e model is well 
suited for free-shear flows without large adverse pressure gradients [10]. 


5 Conclusions 

The scope of the work is to present numerical studies of turbulent combustion 
by means of three different flames: a premixed, a partially premixed and a non- 
premixed flame. Aim of the work is to validate the newly developed unified turbulent 
flame-speed closure (UTFC) combustion model. The turbulent reactive flows have 
been computed on relatively fine grids using compressible LES and RANS methods. 
In the premixed jet flame case, the LES showed very good agreement with the 
measurements of the mean flow variables and the Reynolds stresses. The inflow 
turbulence condition was found to play a crucial role for the modelled reaction 
rate and the resulting flame length, which is well reproduced by use of an inflow 
generator. The influence of mesh resolutions on the interaction between turbulence 
and flame front has been pointed out. The second case was represented by a 
lifted partially premixed flame, the stabilization mechanism of the lifted flame 
was captured properly by use of the UTFC model. The results from LES/RANS 
simulations showed a good agreement with the measured streamwise and radial 
profiles for the mean and rms values of £ and T. For the non-premixed H3 Flame, 
the predicted flame length from RANS and LES were very close to the measured 
one. The computed mean radial profiles are very promising in comparison to the 
experiment. 

In fact, it is doubtful whether the differences between the simulation results and 
the measured data are solely attributable to the drawbacks involved in the UTFC 
approach. For example, accurate resolution of the mixing field plays a decisive 
role for computation of diffusion flames. However, this issue is mainly covered by 
the turbulence modeling. In addition, the turbulence parameters in S, are derived 
from the turbulence model applied as well which influence the reaction rate to an 
extent. Also, the mesh resolution remains still a quality-determining factor for LES. 
Nevertheless, there are only very few alternative models until now which could be 
used to simulate such different flames using one single reaction model. There are 
models indeed, which could provide better results for some of the demonstrated 
flames, but these will completely fail elsewhere. However, it is expected that a 
refined mesh and more accurate specifications of the inlet BCs or, more importantly, 
use of more sophisticated turbulence or mixing model will improve the results. 
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In summary, it can be stated that the UTFC model is aimed to keep track 
of the characteristic rate determining time scale of the combustion process to 
describe the resulting heat release. Therefore, the existence of such a time scale 
in most of the combustion systems makes this approach suitable for modeling of 
turbulent premixed, non-premixed and partially premixed combustion. Because of 
the simplicity, the efficiency and the robustness of the proposed model and according 
to the different cases already carried out in a variety of systems, it is conceivable that 
the model may be useful for practical applications. For this purpose, the influence 
of the chemical kinetics at different operational conditions, like preheating, elevated 
pressure, and use of different fuels, can be simply implied by modifying the burning 
speed Si. 
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Lagrangian Approach for the Prediction 
of Slagging and Fouling in Pulverized Coal 
Combustion 

Olaf Lemp, Uwe Schnell, and Gunter Scheffknecht 


Abstract The deposition of ash particles in pulverized coal combustion provokes 
several problems for the operation of utility boilers. In order to avoid such problems, 
power plant operators have great interest in predicting the slagging and fouling 
tendency of the used fuel. 

For this purpose, an industrially highly relevant tool for the prediction of slagging 
and fouling which is applicable on high performance computing platforms such 
as vector machines or massively parallel systems has been developed. The model 
has been implemented into the CFD code AIOLOS and couples several relevant 
processes that are crucial for the build-up of depositions in power plants. It accounts 
for the flight of the ash particles through the furnace, the corresponding interaction 
with the flue gas and considers several deposition mechanisms on walls and tube 
bundles. In case of a predicted contact between a particle and a surface, the 
deposition rate is calculated based on the stickiness of the particle and the surface 
which is correlated with the melting behaviour. The model also takes into account 
the change of the heat transfer resistance of the already deposited particles and 
consequently the influence on the flue gas temperature. 

The model has exemplary been applied to a utility boiler with a thermal input of 
730 MW (360MW e i) in order to demonstrate the capability of this engineering tool. 

1 Introduction 

Most of the German power plants are fired either with imported bituminous coals or 
with local lignite coals. Bituminous coals are usually purchased on the world market 
depending on economic parameters. This can result in the combustion of a coal 
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which has not been considered during the design process of the boiler. However, 
depending on the composition of the mineral matter of the coal, the slagging and 
fouling behaviour in the course of the combustion process may vary significantly. 

The impact of slagging and fouling on the functioning of a power plant is 
multiple. Due to the isolating effect of the deposition on the heating tubes the heat 
transfer from the flue gas to the water-steam cycle can be significantly deteriorated 
leading to an increase of the temperature of the flue gas at the exit of the furnace. 
Furthermore, the heat absorption in the convective heat exchangers may be reduced 
leading to a possible elevated heat transfer in bundles further down stream and 
consequently to higher steam temperatures that may excess the design parameters 
of the pressure parts. Moreover, ash shedding can clog the ash hopper or damage 
the furnace, as also do the corrosive components of the ashes that can attack the 
tube material. In addition the constriction of the flue gas path due to depositions in 
the tube bundles lead to an increase of the flow resistance and to a decrease of the 
maximum flue gas load. These specified issues cause a significant deterioration of 
the power plant efficiency or even force the operators to unscheduled shutdowns. 

For those reasons power plant operators have a big interest in the prediction of 
slagging and fouling. In the past mainly slagging indices were used to predict the 
deposition behaviour of coals. Most of these indices were developed based on a 
specific boiler geometry and coal and therefore, a transfer to different systems has 
to be carried out carefully. But as computational power rose in the past decade CFD 
comes more and more into play as a relevant engineering tool to predict depositions 
during the design phase and the operation of boilers, or even for the analysis of 
damages. 


2 Description of the Modelling Framework 

2.1 Physical and Chemical Phenomena Causing Deposition 

Slagging defines the deposition in the furnace where radiative heat transfer is 
dominant and fouling takes place in the cooler convective heat transfer sections 
of the boiler. Both phenomena are caused by the inorganic components of the coal 
which are either bound to the organic fuel matrix or are trapped in it (included 
minerals), or are present as extraneous mineral particles (excluded minerals). The 
respective amount and chemical composition of them varies depending on the origin 
and geologic age of the coal. Two main processes are relevant for the slagging and 
fouling: On the one hand the condensation of gaseous inorganic components and 
on the other hand, the deposition of fly ash particles liberated from the coal matrix 
during pyrolysis and char combustion [1,2]. The condensation of species from the 
gaseous phase is usually the initial step of deposit build-up, leading to an initial 
layer which is characterized by its poor heat conductivity and low viscosity. As a 
consequence, this melt layer can “catch” bigger particles. Due to space limitations, 
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Fig. 1 Flowchart framework 


only the modelling of the processes of the solid minerals will be described in the 
following. The computational framework of the deposition of gaseous minerals is 
presented in detail in [3]. 


2.2 Modelling Approach 

The model consists of the simulation of the turbulent flow, the temperature and the 
concentration field of the main flue gas species with an Eulerian approach whose 
results are set as boundary conditions for the calculation of the particle trajectories 
using a Lagrangian approach. In case of a predicted collision with a wall or tube, the 
sticking probability is estimated and a mass deposition rate can be calculated. With a 
time-dependent extrapolation the thickness of the deposition can be predicted which 
can be used for subsequent calculations with an adjusted heat transfer resistance. 
Figure 1 shows a very simplified flowchart of the program with the governing parts 
of the modelling approach. 


2.3 Simulation of Pulverized Coal Combustion 

The CFD code AIOLOS, developed by IFK, has been tailored for the simulation 
of turbulent reacting flows which are typical in pulverized coal combustion. Special 
emphasis has been put on the interaction between heat transfer, chemical reactions 
and turbulent flow. AIOLOS solves the transport equations for mass, species, 
momentum and enthalpy which are obtained from the respective conservation laws. 
Due to their similar appearance they can be described with a general transport 
equation which - for the direction j - may be expressed as: 

+ ( 1 ) 

sink/source 


8(p0) 3 (puj0) _ 3 / 80 \ 
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convective 
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with p, t, u, x, r$, and Sp denoting density, time, velocity, coordinate, diffusion 
coefficient, and source term, respectively. The system of equations is a set of 
strongly coupled non-linear differential equations which demands to be solved 
numerically. Discretization in AIOLOS is carried out with the Finite Volume 
method. 

To cover all the relevant phenomena of pulverized coal combustion several 
submodels have been developed. Turbulence phenomena are described with the 
k-e model. Pressure-velocity coupling is modelled by the SIMPLE method in 
combination with the interpolation scheme from Date for pressure correction [4], 
Radiation is described with the Discrete Ordinate method [5]. Turbulence-chemistry 
interaction is modelled with the Eddy Dissipation Concept (EDC) proposed by 
Magnussen [6], and a global reaction scheme taking into account pyrolysis, char 
burnout and volatile combustion is applied. The modular structure of the code allows 
the continuous expansion, e.g. the simulation of new developments in combustion 
technology (e.g. oxy-coal combustion [7]) or the coupling of the furnace simulation 
with a highly detailed model of the water-steam cycle [8]. In addition, the code 
has been optimized to achieve high performance on parallel and vector computers. 
Further information concerning the AIOLOS code is given elsewhere [5,9,10]. 


2.4 Modelling of the Two-Phase Flow 

For the prediction of slagging and fouling, special emphasis has been put on the 
description of the two-phase flow. The interaction between dispersed coal particles 
in a continuous carrier gas flow can be described in two ways. The so-called 
Eulerian approach approximates the two-phase flow as quasi one-phase (continuum 
approximation). This is a common procedure for the simulation of pulverized coal 
combustion due to the low concentration of fuel particles in the gas phase. The 
balance is carried on in a fixed coordinate system, and the slip between gas and 
dispersed particle phase is neglected (see also [11]). In AIOLOS, the Eulerian 
approach is always used for the basic simulation, providing the velocity, temperature 
and main gas species concentrations fields which are the boundary conditions for the 
prediction of the particle trajectories. 

Lagrangian Approach for the Calculation of the Particle Trajectories 

The space-resolved calculation of the deposition process is modelled with a 
Lagrangian approach. As the particles pass through a discrete number of turbulent 
eddies the respective interaction of the turbulence and the particle has to be taken 
into account. Therefore, the calculation of the particle trajectories is based on a 
stochastic separated flow (SSF) model which uses a Lagrangian method of the 
equation of motion for each particle: 


d(m P ■ up) 


= Fo + F g 


dt 


( 2 ) 
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with the particle mass «?/>, the particle velocity up, the drag force Fp> and 
gravitational force F g . The drag force is defined as 


F d =C d 


H Rep 


24 


dp 


— A p (ug — up) with Cd = -(1 + 0.15/?e'p 

2 Rep 


, 0.687 


) (3) 


where /x is the molecular viscosity, ug.p the instantaneous velocity of the gas phase 
or the particle, Cp> the drag coefficient - considering the ratio to the Stokes drag - 
and the Reynolds number Rep of the particle which is defined as 


Rep 


I up — ug| dppG 


(4) 


Hence the particle velocity can be calculated by solving Eq. (2) analytically: 
up = u G + (u° P - u G ) exp F~ j + tpg ^1 - y- 

with the relaxation time of the particle r p which is defined as: 

4 p P dj, 

3 /z Cd Rep 


(At 
1 — exp I — 

W / J 


(5) 


(6) 


and the interaction time At which is defined as the minimum of either the transit 
time tp of a particle across a turbulent eddy or the eddy-life time tp [12] which are 
given as 

min(t E = Hr = -r in fl - -(7) 

\ y/2k/3 V r\u G -u P \)) 

These times are calculated assuming that the characteristic size of an eddy is the 
dissipation length scale L e [13] 


p / 2 

t _ r 3 / A - _ 

Le ~ B 


( 8 ) 


Finally, the new particle position after the interaction time At can be calculated: 


x = x° + At 


u° p + u p 
2 


(9) 


with tc° and u° denoting the position of, and the velocity at the previously considered 
time step, respectively. 

Velocity fluctuations. In the SSF-model velocity fluctuations caused by the gas 
phase turbulence are assumed to be isotropic with a Gaussian probability density 
distribution having a standard deviation of ^2k/3. The local distribution is 
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randomly sampled when a particle enters an eddy to obtain the instantaneous 
velocity. Therefore, the mean gas velocity ug is superimposed by a fluctuating 
component u ' G : 


ug — u g + u' G 


( 10 ) 


It should also be mentioned that each Lagrangian particle in the SSF-model 
represents a fraction of the entire fuel mass stream. Therefore, a high number of 
particle trajectories has to be calculated in order to achieve a statistically valid result. 


Energy Balance 

An important aspect for predicting the stickiness of the particle is its temperature. 
In this framework the temperature is calculated from an energy balance considering 
the convective heat transfer Q c , the radiative heat transfer between particle and gas 
Q, and the particle combustion heat release Q com [14]: 



( 11 ) 


com 


2.5 Description of Tube Bundles in AIOLOS 

AIOLOS treats tube bundles as porous blocks. No special delimitation of the mesh 
is therefore considered. The corresponding finite volume cells are treated as internal 
computational cells which are expanded with additional sink and source terms for 
enthalpy, radiation intensity and momentum. In case of a contact of the flue gas 
with them, heat is transferred to the tubes and the flue gas experiences a pressure 
drop. The amount of the source and sink term depends on the packing of the tube 
bundle [15]. 

2.6 Calculation of the Deposition Rate 

In case of the collision of a particle with a surface the net deposition rate has to 
be calculated. For this purpose, a two-step model proposed by Pykkonen [16] is 
used. The model takes into account that in the Lagrangian framework, the particle 
cloud represents a fraction of the entire fuel stream. After each contact with a 
surface a fraction of this cloud can be deposited. Therefore, each Lagrangian particle 
is aggregated with a variable Gp.t considering the concentration of its not-yet 
deposited mass. This value is set at the beginning of the simulation to 1 (=100%) 
and decreases accordingly to the occurrence of deposition processes: 


Cp.t = Cpj -1 ■ (1 — /imp/stck) 


(12) 


Slagging and Fouling in Pulverized Coal Combustion 


213 


/i mp denotes the impact propensity and /s tc k the sticking propensity which are 
calculated each time-step when a particle has been detected in a next-to-wall cell. 
This is interpreted by the model as a collision between particle and surface. 


Calculation of the Impact Propensity 

For the calculation of the impact propensity /i mp , three types of contacts between a 
particle and a surface are distinguished: 

• Contact particle vs. furnace wall 

• Contact particle vs. tube in the first row of a tube bundle 

• Contact particle vs. tube in a tube bundle 

In each case the driving force is different. In the first and the second case inertia 
is the main driving force. In the third case the main force is due to eddy impaction 
when a particle has not enough momentum to follow the streamlines (turbophore¬ 
sis). The so-called thermophoresis which describes the impaction tendency due 
to a temperature gradient in the boundary layer is neglected. In the first case the 
impaction propensity is assumed to be equal to 1. In the second case the impaction 
propensity depends on the cross-sectional area of the tubes A-mbe, the cross- 
sectional of the furnace Across and the parameter > 7 / which considers the impaction 
of particles on cross flowed cylinders [11], The propensity is given as 



In the third case the packing of the tube bundles and boundary layer mechanisms are 
decisive. The model calculates the concentration of the corresponding resting time 
tjB in the next-to-wall cell and calculates a so-called deposition velocity ud caused 
by the turbulent impaction. The impaction propensity is calculated as follows: 

f _UD-tTB-A TB 

/lmp “ - Vtb - (14) 

The impaction efficiency is also a function of the tube bundle surface Ajb per 
volume tube bank Vtb ■ A more detailed description of the models and their 
application can be found in [11,17]. 


Calculation of the Sticking Propensity 

The calculation of the stickiness of the particle and the surface is based on Walsh’s 
proposal [ 1 8] that considers the stickiness of the particle itself and the stickiness of 
the surface to predict the sticking propensity: 

/stck= PATp) + [1 - Pp(Tp)]-P S (T S ) (15) 


sticky particles 


sticky surface 
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Pp,s(Tp s) denotes the sticking propensity of the particle and surface as a function 
of the corresponding temperature. Walsh proposed to predict the sticking propensity 
calculating the actual viscosity of the particle and the surface and setting it into a 
ratio with a so-called reference viscosity which is defined as the viscosity below 
each particle would stick completely. The calculation of the viscosity was mostly 
carried out with empirical correlations based on measurements of glass-rich melts. 
As the measurement of the reference viscosity is quite time-consuming, and the 
approaches for the calculation of the viscosity have been observed by the authors 
as only valid for specific ash compositions [19], alternative models had to be 
developed. Therefore, an approach which was derived for the deposition of ashes 
from bio-fuels has been transferred to coal combustion. Frandsen [20] proposed to 
estimate the sticking propensity based on the melting behaviour of the fuel. The 
model assumes that if the melted fraction is 

• Less than 10 % the particle or surface is non-sticky, 

• Between 10 and 70 % a linear approach is assumed, 

• Higher than 70 % the particle or surface is fully sticky. 

The melt fraction of the coal as a function of the temperature can be calculated 
with the software FactSage [21]. The calculation relies on the mineral content of 
the fuel assuming a thermodynamical equilibrium. The basis of this software is a 
huge database with thermochemical data from measurements of mostly binary and 
ternary systems which are applied for the multi-component system that represents 
ash particles [22]. 


Calculation of the Net Mass Deposition Rate 

Finally, the net mass fraction of deposition mdep.net per square meter and second is 
calculated by 

mdep.net = mAsh ' ,/sick * ,/lrnp ke^-Ash (16) 

with mAsh denoting the arriving mass flow, the specific mass of the already deposited 
ash mAsh and an erosion coefficient k e which takes into account several shedding 
mechanisms (e.g. soot blowing). This coefficient has not yet been extensively 
investigated and scarce information is found in literature, but it is in the focus of 
the actual experimental work at IFK. 


2.7 Slag Conductivity 

Special emphasis has also been put on the description of the thermal conductivity of 
already deposited particles. Two different models based on the recommendations 
of Zbogar [23] have been implemented. The conductivity of sintered layers is 
calculated with the model of Hadley, the one of porous deposits with the model of 
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Table 1 Boundary 
conditions of the evaluation 
case 


Inlet flows (kg/s) 

Fuel OFA Primary 

24.46 69 57.89 

Secondary 

132 

Sidewall 

30 

Flopper 

1.12 


Yagii/Kunii. Porosity is in both cases the determining resistance for the conductivity 
of heat between the deposition surface and the heating surface. The porosity s is 
calculated in the following way: 


e= l-[(l-e 0 ) + *Meit(l-eo)] (17) 

where €o denotes the initial porosity which depends on the characteristics of the 
deposit (sulfatic or silicatic) and Y Me i, which is the calculated melt fraction [24], 


3 Simulation Results 

3.1 Simulation of a Utility Boiler 

In this section results of the simulation of a power plant with a thermal input of 
750 MW are presented. The firing system of the utility boiler consists of 12 air- 
staged swirl burners on three levels. The burners are arranged asymmetrically on 
each wall of the furnace leading to a rotational flow of the flame. Each burner level 
is additionally provided with two air nozzles to avoid oxygen lean areas next to the 
walls that can provoke corrosion of the walls. The over fire air (OFA) is installed 
at an elevation of 28.4 m. The overall dimensions of the boiler firing imported 
bituminous coal are 12 x 12 x 68 m. The operating conditions are summarized in 
Table 1. 

For the simulation with AIOLOS a three-dimensional CFD model consisting of 18 
domains is used. Each burner is individually discretized, as also the four burner near 
regions of each wall, one main region and the convective area. The tube bundles are 
modelled with porous blocks. The CFD model with a total cell number of around 
5.8 mio is outlined in Fig. 2 (left). Several validation simulations with AIOLOS 
have already been made for the considered utility boiler (see [5,25]). Exemplary, an 
isosurface at a temperature of 1,400 °C is outlined in Fig. 2 (right) for displaying the 
typical rotational shape of the flame. 

In the presented case a North American coal (Pittsburgh #8 coal) is fired. The fuel 
properties are summarized in Table 2. For the calculation of the melting behaviour 
the “FactSage” software is used as aforementioned. Figure 3 shows the respective 
results of the calculation. The green curve represents the amount of melted phases 
relative to the sum of the solid and the slag phase, the black line describes 
the sticking propensity. In this case, at a temperature of 1,050 °C 10% of melt 
fraction is reached and the particle and the surface start to become sticky. Between 
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Fig. 2 CFD model of the 
utility boiler ( left ; dimensions 
in m) and isosurface of 
temperature 1,400 °C (right) 




Table 2 Properties of Pittsburgh #8 coal 



Proximate analysis 



Ultimate analysis 




Cfix 

(wt.-%) 

Volatiles 

(wt.-%) 

Moisture 

(wt.-%) 

Ash 

(wt.-%) 

C 

(wt.-%) 

H 

(wt.-%) 

N 

(wt.-%) 

S 

(wt.-%) 

O 

(wt.-%) 

ar* 

53.29 

31.68 

9.27 

12.14 

67.20 

4.28 

1.35 

0.88 

4.88 

daf b 

67.80 

40.31 

- 

- 

85.11 

5.45 

1.72 

1.12 

6.21 

Ash oxide analysis 








Si0 2 

ai 2 o 3 

CaO 

Fe 2 0 3 

MgO 

Na 2 0 

K 2 0 

Ti0 2 

so 3 

p 2 o 5 

63.90 

24.12 

0.44 

1.95 

0.32 

0.29 

2.27 

0.41 

3.63 

0.16 


a As received 
b Dry, ash-free basis 


1,050 and 1,430 °C (70%) the following linear correlation is assumed for the 
stickiness: 

/stck = -0.0026 • T P/S - 2.7632 (18) 

Above this temperature full stickiness is assumed for the surface and the particle 
which implies that the sticking propensity equals 1. This calculation is carried out 
before the CFD simulation. 

The presented results correspond to a theoretical study of two cases. In both 
cases the operating conditions are equal, only the initial condition of the deposition 
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Fig. 4 Deposition rate on rear wall, comparison of clean boiler (left) with slagged boiler (right) 


thickness on the surfaces is varied. In the first case an almost clean furnace was 
assumed and in the second case, a typical condition at the end of the operating 
time was assumed. Figure 4 shows the predicted deposition rate at the rear wall of 
the furnace. The rates are normalized by the maximum calculated deposition rate 
in the second case. It is quite obvious that in the second case the deposition rate 
is higher than in the first case. Due to the deteriorated heat transfer the flue gas 
temperature rises, and the resulting higher surface temperatures lead to an increased 
sticking propensity. A similar result can also be observed in the tube bundles. The 
results for the first two bundles are shown in Fig. 5 (in both cases a cut through the 
centre of the bundle is shown). This calculation provides valuable information about 
the positions with high deposition rates which can be used for the optimization of 
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Rate (•) 

0.2 0.5 0.8 

0 1 


Rate (-) 

0.2 0.5 0.8 



Fig. 5 Deposition rate in tube bundles, comparison of clean boiler (left) with slagged boiler (right) 


soot blower operation. The investigated utilty boiler has no remarkable slagging and 
fouling problems, only around the burner quarl high deposition rates are reported 
from the operators. This can be also observed in the presented results, especially on 
the second burner level. 


3.2 Performance 

The presented case consists of almost 6 million computational cells and a total of 
18 domains. The calculations were performed at the platforms provided by the High 
Performance Computing Center Stuttgart (HLRS). The calculations were executed 
on two vector platforms. On the NEC-SX8 8 CPUs, and on the NEC-SX9 16 CPUs 
were utilized. An average vector length of 222 (SX9: 196) has been achieved. The 
computational performance is about 1.6 GFLOPS (SX9: 0.9 GFLOPS) and the 
memory requirement is 11GB. Each simulation requires about 120,000 iterations 
until convergence which results in a total elapsed time for a full simulation of about 
120 h (SX9: 64 h). The lower performance compared to previous simulations of 
pulverized coal combustion with AIOLOS [7, 8] is attributed to the multi-domain 
grid and the high number of overlapping cells in this specific case. Nevertheless, 
investigations to increase the performance of this particular case have been initiated 
showing promising results. The calculation of the particle trajectories is carried out 
on a massively parallel computer. 500,000 particle trajectories and the consideration 
of 3,000 time steps take a computing time of about 2 h. 

The actual development work focuses specially in the optimization of the code 
on the new Cray XE6 “Hermit” platform at HLRS. 
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4 Conclusions and Outlook 


The modelling framework of a highly useful industrial application for the prediction 
of slagging and fouling in coal-fired utility boilers has been presented. The model 
couples a huge number of physical and chemical processes that determine the 
deposition of gaseous and solid components of the fuel on heat transfer walls 
and tube bundles. Due to the lack of experimental data a full validation has not 
been carried out yet. However, the first simulation results provided plausible and 
promising results. The software tool can be applied during the design process, for 
the retrospective investigation of damage events as a consequence of slagging and 
fouling phenomena, and for the optimization of the boiler performance (e.g. soot 
blower operation). It is applicable for a wide band of coals and can consider the 
influence of different operating conditions on the deposition tendency. Due to the 
multitude of processes the use of high-performance computing (HPC) is a must. 
Also the influence of the flue gas on the water-steam cycle is currently in the focus 
of the development. Therefore, the coupling of the tool to IFK’s water-steam code 
[8] will be carried on. 
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Part IV 

Computational Fluid Dynamics 

Prof. Dr.-Ing. Siegfried Wagner 


The following paragraph represents the selection of papers submitted to HLRS 
for the Review Workshop 2012 that revealed a very high scientific standard and 
demonstrated at the same time the unalterable usage of high performance computers 
(HPCs) for the solution of the problem. 

It is very important that users together with members of the HLRS further 
develop highly sophisticated numerical methods and advanced algorithms. Remark¬ 
able progress in this respect is demonstrated by the present contributions over the 
last year. One part of the users ran their jobs on the NEC SX-9 and on the NEC 
Nehalem Cluster whereas others already switched to the new platform of HLRS, 
i.e. the Cray XE6 “Hermit”. Thus, the number of cores that were employed by 
several users at the “Hermit” has increased remarkably compared to the situation 
on the NEC SX-9. However, the maximum number of cores used is far below the 
113,664 cores that are offered by the “Hermit”. Besides changing from the vector 
computer NEC SX-9 to the massively parallel platform Cray XE6, the users moved 
more and more to highly sophisticated methods, e.g. Discontinuous Galerkin (DG) 
method. The DG method is especially suited to massively parallel platforms and 
offers higher order methods with an acceptable programming effort. 

As an example, Christoph Altmann et al. applied a DG method and an unstruc¬ 
tured grid for their numerical simulations using up to 4,096 cores. They plan to 
extend their code to be able to run reliable simulations on 0(10,000) and even on 
0(100,000) processors. The extension of the codes to be able to use a higher number 
of cores than so far is and should be a central concern of all users of massively 
parallel high performance computers. 

Rebecca Busch et al. used also a DG method up to fourth order. In addition, they 
performed the evolution of the surface and volume integrals in small groups which 
they called patches. The data in the patches are stored in a memory efficient way, 
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i.e. the data could very quickly be transferred from the memory to the CPU. This 
procedure enabled them a reduction of computation time of up to 84.4 %. 

Groskopf and Kloker studied the effects of an oblique roughness on the hyper¬ 
sonic boundary layer by Direct Numerical Simulation (DNS) using up to tenth order 
explicit finite differences (EFDs). They ported their NS3D code from the NEC SX-9 
to the Cray XE6 “Hermit” and implemented explicit finite differences to speed up 
the computation of derivatives for a large number of domains. That way they could 
significantly reduce the amount of necessary MPI communications and increase the 
performance compared to the more accurate and more expensive compact finite 
differences they have formerly used. They demonstrated a remarkable speed-up 
from 14 to 4,096 MPI processes and gained even a superlinear speed-up due to 
cache effects when using EFDs. 

Breuer and Alletto investigated the effect of wall roughness seen by particles in 
turbulent channel and pipe flows. They could show that LES with a combination of a 
Lagrangian treatment of the disperse phase could considerably improve the particle 
statistics in turbulent channel and pipe flow by incorporating a recently published 
wall roughness model for the solid phase. 

Martin Konopka et al. used LES to study supersonic film cooling at incident 
shock-wave interaction. They reached a performance of 285 GFlop/s on 16 CPUs of 
the vector computer NEC SX-9. Their code is completely vectorized with a vector 
operation ratio higher than 99 %. 

Kloren and Laurien investigated stratified and non-stratified mixing flows and 
also used Large Eddy Simulation (LES) for their studies in order to reduce the 
computational effort compared to DNS but to increase the accuracy compared to 
RANS. An interesting result has been that the speed-up with four cores per node is 
remarkably better than with eight cores per node. 

Jastrow and Magagnato simulated compressible viscous flow and solved the 
compressible Navier-Stokes equations in the outer flow field using an approximate 
Riemann solver while using the simplified boundary-layer equations near the wall. 
The specialty of the code of Jastrow and Magagnato is the Cartesian-grid immersed 
boundary method (IBM) that offers an interesting approach to realize automatic 
Cartesian mesh generation. They used 1,344 Opteron cores of the Cray XE6 and 
consumed about 24 h computational time for one unsteady calculation in three 
dimensions using about ten million points. 

K. Niibler et al. examine the shock control bump flow physics by numerical 
simulation on the NEC Nehalem Cluster using the well-known RANS code FLOWer 
and by wind tunnel measurements. In order to identify a good bump design they 
performed automated shape optimizations and found out that around 80 consecutive 
computations are necessary to define the optimum shape. 

Starzmann et al. solved also the Reynolds-averaged Navier-Stokes (RANS) 
equations for their numerical simulations and simulated two-phase flows by the 
ANSYS Code CFX on the NEC Nehalem Cluster. They needed a simulation time of 
only 2 weeks on the high performance computing cluster for the treatment of their 
project whereas on a simple PC cluster the pure numerical calculations would have 
run between 2 and 3 months. 
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The project of Markus Wittmann et al. aimed at improving the scalable domain 
decomposition and partitioning scheme for the Lattice Boltzmann flow solver 
ILBDC. The authors did not use any computers of the HLRS but of other national 
and international platforms. An important aspect of their contribution is the devel¬ 
opment and evaluation of GPGPU-enabled codes. Larger scaling and benchmarking 
tests were carried out on NERSC’s Dirac system and on the Japanese Tsubame 
2 platform. In addition, some specific results deal with potential alternatives to 
MPI, especially MPC and Co-Array FORTRAN. Although they did not observe 
substantial improvement compared to plain MPI usage, the result, although it is 
negative, is important information for HPC users in general. 


Discontinuous Galerkin for High Performance 
Computational Fluid Dynamics 


Christoph Altmann, Andrea Beck, Andreas Birkefeld, Gregor Gassner, 
Florian Hindenlang, Claus-Dieter Munz, and Marc Staudenmaier 


Abstract In this report we present selected simulations performed on the HLRS 
clusters. Our simulation framework is based on the discontinuous Galerkin method 
and consists of four different codes, each of which is developed with a distinct 
focus. All of those codes are written with a special emphasis on (MPI) based high 
performance computing. Results of compressible flow simulations such as flow past 
a sphere, compressible jet flow and isotropic homogeneous turbulence as well as an 
application of our aeroacoustic framework are reported. All simulations are typically 
performed on hundreds and thousands of CPU cores. 


1 Introduction 

The central goal of our research is the development of high order discretiza¬ 
tion schemes for a wide range of continuum mechanic problems with a special 
emphasis on fluid dynamics. Therein, the main research focus lies on the class 
of Discontinuous Galerkin (DG) schemes. The inhouse simulation framework 
consists of four different discontinuous Galerkin based codes with different features, 
such as structured/unstructured grids, non-conforming grids (7?-adaptation), non- 
conforming approximations spaces (/(-adaptation), high order grids (curved) for 
approximation of complex geometries, modal and nodal hybrid finite elements and 
spectral elements with either Legendre-Gauss or Legendre-Gauss-Lobatto nodes. 
The time discretization is an important aspect in our research and plays a major 
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role in the computing performance of the resulting method. The simulation frame¬ 
work includes standard explicit integrators such as Runge-Kutta, a time accurate 
local time stepping scheme developed inhouse and an implicit time discretization 
based on implicit Runge-Kutta methods. The general layout of the framework 
outsources all aspects of a specific physical problem to be solved (e.g. fluid 
dynamics) in an encapsulated module separated from the main code by clearly 
defined interfaces. Thus by exchanging this physical problem definition module, 
the framework is able to solve various partial differential equations such as the 
compressible Navier-Stokes equations (fluid dynamics), linearized Euler equations 
(aeroacoustics). Maxwells equations (electrodynamics) and Magnetohydrodynamik 
equations (Plasma simulation). One of the major foci in the group is the simulation 
of unsteady compressible turbulence in the context of Large Eddy Simulation (LES) 
and Direct Numerical Simulation (DNS). Due to the occurance of multiple spatial 
and temporal scales in such problems and the resulting high demand in resolution 
for both, space and time, a high performance computing framework is mandatory. 


2 Description of Methods and Algorithms 

Discontinuous Galerkin (DG) schemes may be considered a combination of finite 
volume (FV) and finite element (FE) schemes. While the approximate solution is a 
continuous polynomial in every grid cell, discontinuities at the grid cell interfaces 
are allowed which enables the resolution of strong gradients. The jumps on the cell 
interfaces are resolved by Riemann solver techniques, already well-known from the 
finite volume community. Due to their interior grid cell resolution with high order 
polynomials, the DG schemes can use coarser grids. The main advantage of DG 
schemes compared to other high order schemes (Finite Differences, Reconstructed 
FV) is that the high order accuracy is preserved even on distorted and irregular grids. 


2.1 High Order Discontinuous Galerkin Solver HALO 

A nodal discontinuous Galerkin scheme on a modal basis is implemented in the 
code HALO (Highly Adaptive Local Operator). The code runs on unstructured 
meshes composed of hexahedra, prisms, pyramids and tetrahedra. To maintain the 
high order accuracy at curved wall boundaries, a high order representation of the 
element boundaries is required. Several techniques for the construction of curved 
element boundaries are used, see [2,9]. The code is designed for the computation of 
unsteady flow problems and fully parallelized with MPI [2] . The scheme is explicit 
and therefore each grid cell only needs direct neighbor information. This property 
allows a very efficient parallelization. The computation domain is decomposed 
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by either ParMetis or recently also by the use of space filling curves. A major 
disadvantage of an explicit DG scheme can be the global time step restriction for 
guaranteed temporal stability. This restriction depends on the grid cell size, on the 
degree of the polynomial approximation and on wave speeds for advection terms and 
on diffusion coefficients for diffusion terms. In HALO, this drawback is overcome 
by a special time discretization, the so called time-consistent local time stepping 
[3,4], The stability criterion is applied only locally to each grid cell, thus each cell 
runs with its optimal time step. Hence, the computational effort is concentrated on 
the grid cells with small time steps. On meshes with strongly varying grid cells as 
well as flow velocities, the number of operations is greatly reduced compared to an 
explicit global time stepping approach. 


2.2 High Order DGSEM Solver STRUKTI 

A very efficient variant of a discontinuous Galerkin formulation is the discontinuous 
Galerkin spectral element method (DGSEM). This special variation of the DG- 
method is based on a nodal tensor-product basis with collocated integration and 
interpolation points on hexahedral elements, allowing for very efficient dimension- 
by-dimension element-wise operations. 

An easy-to-use structured code (STRUKTI) was set up to test the performance 
of this method, especially for large scale calculations. 


2.3 High Order DGSEM Solver FLEXI 

To enable the efficient simulation of complex geometries, a second DGSEM based 
solver was developed. Sharing the same numerical discretization as STRUKTI, 
FLEXI is tailored to handle unstructured and even non-conforming hexahedra 
meshes. A common base tool for grid pre-processing shared by the hybrid unstruc¬ 
tured solver HALO and by FLEXI was developed. This program allows us to 
process grid files from different commercial grid generators and translate them into 
readable HALO/FLEXI meshes. Furthermore, a module for curved grid generation 
and for non-konforming grid connection is included in this tool. As FLEXI shares 
the same efficient discretization as STRUKTI, the performance of both codes is 
comparable as in a high order method, the effort of managing the grid is negligible. 
The difference lies in the parallelization of both codes. FLEXI uses domain partition 
based on space filling curves, whereas STRUKTI is optimized for structured 
meshes. Benchmarking of STRUKTI and FLEXI and improvements of FLEXIs 
parallelization is an ongoing important task in the group. 
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2.4 High Order Discontinuous Galerkin Acoustic Solver 
NoisSol 

For the simulation of flow induced acoustic phenomena in complex domains a 
high order discontinuous Galerkin based solver is very well suited. It combines the 
use of unstructured grids, a low sensitivity to grid quality and low dispersion and 
dissipation errors. 

NoisSol is a solver for the linearized acoustic equations (Linearized Euler 
Equations and Acoustic Perturbation Equations [17]). It applies a discontinuous 
Galerkin scheme on triangular or tetrahedronal grid cells. The time discretization 
employs either an ADER scheme (Arbitrary High Order Scheme using Derivatives) 
or a Taylor-DG scheme [6]. These schemes offer an arbitrary high order of 
convergence in space and time. NoisSol includes curved elements and allows MPI 
based parallel computations to reduce the overall wall-clock computation time. 
A further feature is the coupling to the finite-difference solver PIANO [18] for a 
domain decomposition between near field and far field. 


3 Implicit Large Eddy Simulation of the Taylor Green Vortex 

The Taylor-Green Vortex is one of the classical canonical cases for the numerical 
investigations into turbulence dynamics. Its simple initialization and boundary con¬ 
ditions yet complex non-linear behavior make it an ideal candidate for the evaluation 
of numerical schemes and the design and development of Large Eddy Simulation 
(LES) subgrid scale models (SGS). An open question regarding these coarse struc¬ 
ture simulations with model closures for the unresolved quantities is whether high 
order schemes retain their superior accuracy per degree of freedom (which is well 
established for well-resolved problems) in underresolved simulations. In addition to 
these issues regarding numerical accuracy, the contest between low and high order 
schemes in terms of scalability and computational efficiency is still ongoing. 

To find answers to some aspects of these problems we have conducted a series of 
numerical studies of the Taylor-Green Vortex flow at Reynolds number Re = 1,600 
on both the HLRS Nehalem and Hermit clusters with our DGSEM solver “Strukti”. 
We used the Nehalem system to conduct fully resolved benchmark simulations of 
the flow (Direct Numerical Simulations) with at most 134 million spatial degrees of 
freedom. These simulations served as a reference solution for further studies. 

To compare high and low order schemes in a LES-type setting, we now 
selected a fixed resolution of 256 degrees of freedom per spatial direction, which 
is theoretically sufficient for a good resolution of the large and medium scales, 
but not the dissipation-dominated small scales. Our code framework allows an 
arbitrary choice of the number of elements (E) as well as degree of polynomial 
approximation ( N ) within each cell, so 256 degrees of freedom can be achieved by 
a number of different combinations of E and N. Table 1 lists selected computations 
with their choices for E and N as well as number of cores and total walltimes. 
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Table 1 Selected Taylor-Green vortex (Re = 1, 600) computations 


No. of elements E 

N 

DOF per dir 

Stabilization 

No. of cores 

CPU hours 

128 

1 

256 

- 

128 (Cray XE6) 

3,500 

64 

3 

256 

- 

256 (Cray XE6) 

3,840 

32 

7 

256 

- 

256 (Cray XE6) 

3,900 

25 

9 

250 

- 

125 (Cray XE6) 

3,960 

21 

11 

252 

- 

343 (Cray XE6) 

7,000 

16 

15 

256 

Weak 

256 (Cray XE6) 

9,700 

64 

7 

512 

- 

512 (NEC Nehalem) 

80,400 

24 

15 

384 

- 

512 (NEC Nehalem) 

29,600 




Fig. 1 Kinetic energy dissipation rate and zoom in on maximum region 


From Table 1, it might seem that low order approximations (low N) clearly 
outperfom their high order counterparts in terms of computational speed. However, 
especially for simulations of turbulent flows, the quality of the approximation is 
crucial. Figure 1 compares the results for the computations listed in Table 1 for 
the dissipation rate, which is the essential key quantity in the Taylor-Green vortex 
simulation. As obvious from these plots, the high order formulations clearly distance 
lower order ones in terms of accuracy, in particular in the fully turbulent - i.e. 
numerically most demanding - flow regime. Thus, although low N computations 
are at a first glance faster for the same nominal resolution, their deficit in accuracy 
makes the use of high order schemes very attractive. 

This claim is strongly supported by Fig. 2, which compares the computing times 
for a Re = 800 case of a second and 16th order scheme with the exact (DNS) 
solution. For the same number of degrees of freedom (64 3 ), the low order scheme 
is significantly faster than the high order one, but while the high order results is 
very close to the reference solution, the N = 1 result is essentially too far off 
to capture the nature of the solution. Increasing the resolution of the low order 
formulation shows a convergence towards the DNS results, but cannot compete 
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Fig. 2 Comparison of computing time for £?(2) and <?(16) for the Taylor-Green Vortex 
(Re = 800) 



Fig. 3 Visualization of vortex detection criterion A 2 = —1.5 for N = 1, N = 3, and N = 15 
case (left to right , 256 3 DOF in each case) 

with the efficiency of the high order formulation. All compuations were done on 
the Nehalem cluster with 64 nodes. 

A visual impression of our findings is presented in Fig. 3, where we compare 
the vortical structure of the solution for increasing order (and fixed total number of 
degrees of freedom): The significant improvement in solution quality and capturing 
of the important physical structures for higher order shown in Fig. 1 is immediately 
obvious. The full investigation with all results can be found in [27]. 


4 Implicit Large Eddy Simulation of a Compressible 
Roundjet 


We consider the compressible turbulent flow of a roundjet. The considered subsonic 
test case is proposed by Bogey et al. [24] with Mach number Ma = 0.9 and 
Reynoldsnumber Re = 65,000. The setup of the example is plotted in Fig. 4. The 
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Fig. 4 Setup of the roundjet example. A hexahedra based mesh is used with domain (30 x 20 x 
20) ro and analytical initial condition 



Fig. 5 Instantaneous vorticity distribution in z = 0 plane 


mesh consists of 69,445 hexahedra grid cells for the domain (30 x 20 x 20) ro, where 
ro is the radius of the roundjet at the inflow. 

For the computation with FLEXI, polynomials with degree 5 are chosen, result¬ 
ing in about 1.5x 10 7 degrees of freedom per conservative variable. The initialization 
of the flow field is analytic with a 5 % random disturbance, complemented by 
a corresponding random forcing at the inflow during the simulation to drive the 
transition to turbulence. As the spatial resolution is not sufficient to resolve all 
scales, instabilities can occur due to aliasing of the high order methodology. Similar 
to the work of Bogey et ah, stabilization via filtering is used, where the highest 
polynomial mode is filtered by a factor of about 95 %, resulting in an implicit Large 
Eddy Simulation (iLES) type approach for this computation. 

Figure 5 shows the instantaneous distribution of the z -vorticity component and 
Fig. 6 compares the characteristic mean centerline velocity with the iLES results of 
Bogey et al. 

The history of the simulation is listed in Table 2 and the overall computational 
effort for simulating t = 842 s sums up to 13,346 core-h. Looking at the core-h 
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Fig. 6 Comparison of mean 
centerline velocity with the 
iLES results of Bogey 
et al. [24] 



Table 2 History of the iLES jet simulation (FLEXI N = 5, 15 x 10 6 DOF) on the HLRS CRAY 
XE6 system 


#cores 

DOF per core 

Start - end 
sim.time (s) 

Wallclock time (h) 

Core-h 

Core-h per (s) 
sim.time 

128 

118,068 

0-62 

8.00 

1,024 

16.36 

256 

59,034 

62-191 

7.94 

2,033 

15.76 

512 

29,517 

191-322 

3.98 

2,038 

15.56 

736 

20,534 

322-842 

11.21 

8,251 

15.87 


required for 1 s simulation time, the strong scaling from 128 to 736 cores is nearly 
perfect. For the simulation on 128 cores, the high load per core seems to slightly 
slow down the computation, due to caching effects. 


5 Direct Numerical Simulation of a Weak Turbulent Flow 
Past a Sphere 

We consider a weakly turbulent flow past a sphere at Mach number Ma 00 = 0.3 
and a Reynolds number with respect to the sphere diameter D of Reo = 1,000. 
This computation was performed with the unstructured DGSEM code FLEXI. 
The domain extends 25 D downstream and A.5D upstream and circumferentially. 
The unstructured mesh shown in Fig. 7 consists of only 21,128 hexahedra, where 
hexahedra lying on the sphere surface are curved. The computation was performed 
with polynomial degree 4, yielding 2.64 million degrees of freedom per conservative 
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Fig. 7 Unstructured mesh of the sphere, 3D views, front view (left), slice and back view 



Fig. 8 Isosurfaces of A 2 = —0.001 of the sphere flow at Re = 1,000 


variable. The computation was done on the CRAY-XE6 cluster on 4,096 cores, the 
computational effort for the characteristic time unit T* = D/uoo was 100 core-h 
and the simulation was run 300 time units, thus resulting in 30,000 CPU-h or about 
7 h wallclock time. The mean timestep for this computation was At = 1.53T0 _4 T*, 
resulting in a total of about 2 x 10 6 time steps for this computation. 

The mean drag is Cd = 0.48 and the Strouhal number St = 0.32 compares well 
to values reported by Tomboulides and Orzag [23] and the references therein. At this 
Reynolds number, small scales appear in the wake of the sphere due to the Kelvin- 
Helmholtz-like instabilities of the shear layer. A visualization of the A 2 criterion in 
Fig. 8 shows that the behavior of the flow is well captured. As shown in Fig. 9, a 
single layer of curved hexahedra is sufficient to resolve the boundary layer. 

The history of the simulation is listed in Table 3 and the overall computational 
effort for simulating t = 286 s sums up to ^25,000 core-h. Looking at the core-h 
needed for 1 s simulation time, the strong scaling from 352 to 2,048 cores is at 77 % 
and for 4,096 cores is at 61 %. The simulation with 4,096 cores runs with only 5 
elements per core. 
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Fig. 9 Velocity magnitude contours (levels [0. l]«oo) in the x-y plane 


Table 3 History of the sphere simulation (FLEXI N = 4, 2.64 x 10 6 DOF) on the HLRS CRAY 
XE6 system 


#cores 

DOF per core 

Start - end 
sim.time (s) 

Wallclock time (h) 

Core-h 

Core-h per (s) 
sim.time 

352 

7,503 

62-64 

0.32 

113 

56.46 

448 

5,895 

64-72 

0.97 

435 

54.34 

1,024 

2,579 

72-81 

0.50 

515 

57.17 

2,048 

1,290 

81-108 

0.97 

1,988 

73.62 

4,096 

645 

110-286 

4.00 

16,395 

92.94 


6 Slat Noise 

One topic that is of great interest for computational aeroacoustic applications in 
aerospace sciences is the noise generation of an airfoil in high-lift configuration, 
that is, with deployed slat and flap. This application is also a demanding test case 
for acoustic simulation programs, since it combines a very inhomogeneous flow, a 
complex geometry and many different noise generation mechanisms. 

In this application a three part airfoil is examined, which was described by 
Lockard and Choudhari in 2009 [26]. The calculation presented here is based on 
a RANS computation for an unswept wing with an angle of attack of 4°, a Mach 
number of 0.17 and a Reynolds number of 1.7e6. Based on this flow field, sound 
sources in 2D were calculated by Roland Ewert of the IAS at the German Aerospace 
Center (DLR) applying their Fast Random Particle Mesh (FRPM) method [19]. 
The source calculation was limited to a rectangular region around the slat trailing 
edge (see Fig. 10). The mean flow values have also been taken from the RANS 
calculation. The acoustic simulations was performed with the Acoustic Perturbation 
Equations (APE), type 4 [17]. The space and time order of the scheme were set to 
4, the time step was 3.45e—5. A circular domain around the origin with a radius of 
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X 

Fig. 10 Setup NASA 30P30N (arrow points to center of microphone circle) 



Fig. 11 NASA 30P30N, pressure field, t = 2.0 


3.5 was used. It consisted of 74,700 triangular elements with 747,000 degrees of 
freedom. 

Figure 11 shows the pressure field after 2 time units. It includes all the expected 
phenomena. For a quantitative evaluation of the results frequency spectra were 
calculated. The spectrum for a microphone point located at a circle with r = 1.5 at 
ip = —70° is shown in Fig. 12. It proves a good qualitative agreement between the 
frequency spectra of the presented calculation and the reference solution by Dierke 
et al. [25], 

The computation has been performed on the HLRS Nehalem cluster on 
320 cores. For the simulation of 9 time units 5.75 h wallclock time or 1,840 core- 
h were necessary, which results in 27.3e—5 core-h per degree of freedom and 
simulated time unit. 
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<[Hz] 

Fig. 12 NASA 30P30N, sound pressure level, NoisSol and Dierke [25] 


7 Summary and Outlook 

In this project, we have successfully applied our high performance discontinous 
Galerkin based framework to several test problems to enhance the physical and 
numerical modeling capabilities. Furthermore, the high performance computing 
capabilities have been investigated and compared for the different available codes. 
Compressible benchmark flows such as the turbulent flow past a sphere as well as 
large eddy simulations of isotropic homogeneous turbulence and a jet flow have been 
computed and evaluated with typical runs on > 1,000 processors. Our typical reliable 
‘production’ runs use 0(1,000) processors. In the future, we plan to extend and 
improve our framework to support reliable simulation runs on 0(10,000) and even 
on 0(100,000) processors to fully unleash the available (and projected) processing 
power of the HLRS CRAY-XE6 cluster. 
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Computational Aeroacoustics with Higher 
Order Methods 


E. Rebecca Busch, Michael S. Wurst, Manuel Keliler, and Ewald Kramer 


Abstract The Lighthill acoustic analogy in combination with two different higher- 
order CFD solvers is used to investigate the sound generation of two test cases. 
The flow around a cylinder at Re = 150 is analysed with a Discontinous Galerkin 
method and a counter-rotating open rotor (CROR) with a WENO scheme. The 
simulation of the cylinder is able to predict both aerodynamic and acoustic 
behaviour correctly, the vortex street behind the cylinder is responsible for the noise 
radiation similar to of an acoustic dipole. The analysis of the CROR focuses on the 
effect of using a higher-order method with a detailed comparison with a standard 
second order method. While global aerodynamic forces show only small differences, 
the better transport of vortices, especially of the blade tip vortex, is a benefit for 
the prediction of interaction noise of the two rotors. This paper includes different 
investigations on the new HLRS Cray Hermit cluster. The DG code was optimized 
for single-core usage while still maintaining its good parallel performance. The 
effect of node-pinning is studied with the CROR configuration which improved the 
computational time slightly. 


1 Introduction 

For recent research in aerospace engineering, computational fluid dynamics (CFD) 
in combination with computational aeroacoustics (CAA) has developed to a useful 
tool to examine flow fields and acoustic emission caused by flow phenomena, in 
addition to experiments. With the continuous growth of supercomputing power it 
is now not only possible to simulate setups with several million cells in acceptable 
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time, also the use of higher order methods is a possibility. At the Institute of Aero¬ 
dynamics and Gas Dynamics (IAG) of the University of Stuttgart the development 
of higher order codes used for CFD simulation of helicopters has been one of the 
main research fields. Current CFD simulations often include acoustic evaluation 
since noise emission of aircrafts plays a more and more important role due to its 
high annoyance to the community. Aeroacoustics are a part of the current research 
at IAG. In this paper two test cases were simulated with different numerical methods 
and then examined acoustically. A cylinder at a Reynolds number Re = 150 
has been simulated with the higher order Discontinuous Garlerkin (DG) Code 
SUNWinT developed by IAG. The SUNWinT Code is still being enhanced and 
will be used for CFD simulation of helicopters and other applications. Subsequently 
an acoustic analysis has been carried out with the IAG-tool ACCO, which uses the 
Ffowcs Williams-Hawkings (FW-H) equation for acoustic modelling. As a practical 
application, a counter-rotating open rotor at take-off conditions has been simulated 
with a state of the art finite volume code, namely standard FLOWer, developed by 
DLR, and WENO FLOWer, enhanced by IAG. The acoustic evaluation has been 
carried out with ACCO as well. Both FLOWer and ACCO are also used for CFD 
and CAA simulation of helicopters at IAG. After preceding performance studies 
all CFD simulations have been carried out on the new Hermit cluster at the High 
Performance Computing Center in Stuttgart (HLRS). 


2 Numerical Methods 

2.1 Discontinuous Galerkin Code SUNWinT 

2.1.1 Basic Features 

The Discontinuous Galerkin (DG) method as it is used in fluid mechanics combines 
ideas from finite volume (FV) discretisation techniques as well as from finite 
element (FE) discretisation techniques. A typical FE feature of the DG method 
is that the representation of the solution in a cell is given by a polynomial 
approximation, whose accuracy can easily be improved by increasing its polynomial 
order. However, the solution between cells is discontinuous, thus the solution of this 
Riemann problem requires approximate Riemann solvers known from FV methods. 
The method was originally developed for the solution of hyperbolic conservation 
laws as the Euler equation [13] which contains only first-order derivatives. It was 
extended for the Navier-Stokes equations by Bassi and Rebay [12] and is now 
capable to solve equations containing second-order derivatives. 
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2.1.2 Numerical Formulation 


The Navier-Stokes equations are given by 



( 1 ) 


Herein U is the vector of conservative variables, F, and F v are the convective and 
the diffusive fluxes. In order to eliminate the second order derivatives of U. the 
equation is transformed in a first-order system by introducing the gradient of U as 
an additional solution variable. The new system reads as 


0- VU = 0 , 


( 2 ) 


— + V •*;■(£/) = V • F V (U, Vt/) . 
dt 


(3) 


Discretising both equations with the DG approach results in 



where Uy, and 14 are L2-integrable, piecewise polynomial functions. We choose 
a hierarchical and orthogonal basis according to Sherwin and Karniadakis [19]. 
Both explicit as well as implicit time integration schemes are available within our 
code [18]. 


2.2 The FLOWer Code 


FLOWer is a finite-volume code originally developed by DLR and enhanced by 
IAG. It solves the three-dimensional Reynolds averaged Navier-Stokes equations in 
a block-structured domain for numerous time steps [1,2]. For spatial discretisation 
a second order cell-centered finite volume formulation is used. A hybrid multi-stage 
Runge-Kutta scheme developed by Jameson is applied [3,4] for time integration. 
Additionally, a WENO scheme has been recently implemented into FLOWer by 
IAG. Movement and complex geometry structures such as rotors or nacelles 
are embedded with the chimera method. A component grid is integrated into a 
background mesh where a hole is cut to make room for the component grid cells. 
In the overlapping zones between the grids the data exchange is carried out. For 
moving structures like rotors this process is done anew for every time step. With 
this method complex structures can be meshed with structured grids and a local 
refinement is possible while keeping the same background mesh. 
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2.2.1 The Finite Volume Method 

Generally, a finite volume scheme is applied since it has the advantage of adaptabil¬ 
ity to arbitrary meshes without transforming into a computational domain when a 
good quality grid is used. The flow field is split up into hexahedral cells (control 
volumes) where the Navier-Stokes equations are applied in integral form on each 
cell taking care of conservativity [5]. For each cell a discrete flux balance is 
calculated and the change of flow quantities in time in particular points can be 
defined. The scheme is given by the equation 



where w is the vector of conservative variables in this case, n the normal vector and 
f(u) represent the numerical fluxes for a particular control volume arrangement with 
the volume C, surrounding the grid node with the indices i, j, k and the cell face 
dCj. The fluxes are approximated using either a central or an upwind discretisation 
operator. 


2.2.2 The Jameson Scheme 

The Jameson scheme utilises a second order approximation and stabilises it with 
high-order Runge-Kutta time stepping and adding an artificial dissipative term at 
the end of each time step [3, 4]. It is the standard method used in FLOWer. The 
conservative variables are simply averaged at the cell faces by 

u i+ 1 = + «/+i) 


2.2.3 The WENO Scheme 

WENO (“weighted essential non-oscillatory”) is a further development of the 
ENO (“essentially non-oscillatory”) scheme introduced by Harten et al. [6]. The 
basic idea is reconstructing or combining lower order fluxes for a higher order 
approximation. In the ENO scheme the least oscillatory stencil is then picked 
for reconstruction whereas for WENO all stencil candidates are weighed by their 
smoothness and subsequently used for reconstruction [7]. The recombination of 
stencils is done by 

Ar—1 

U i + k = E "' <+1 

r =0 

where co r are the weights so that co r = 1 and k the order of stencils. 

WENO shows a significant performance increase to ENO as the selection constraints 
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are omitted. Three stencils are implemented at an order of 3, leading to a fifth 
order reconstruction. However this is only valid on Cartesian grids with equal cell 
distances in all directions. 


2.3 Acoustic Post-processing 


Acoustic analysis is done with the IAG tool ACCO which uses the FW-H equation: 

d 2 


w-W- 


dxj 3 xj 


[TijH(fs )] 


[{p'n t + put (u n - v„)) 8 (fs)} 


+ ir [(PoVn + p (u n — v„)) 8 (fs)] 


where p' denotes the density fluctuation, c the speed of sound, p' the pressure 
fluctuation, «, the normal vector, u„ the surface component of the velocity, v„ its 
normal component and 8(f) the Dirac delta function [8,9]. The wave equation 
on the left hand side describes the propagation of sound in space while the 
right hand side specifies the monopole and dipole source terms on surfaces and 
quadrupole sources terms in volumes. The FW-H equation allows the integration 
surfaces to surround the noise generating flow structures and geometries. This saves 
the integration of volume sources inside in addition to the surface terms on the 
geometric surfaces. However, this requires the choice of the surrounding surfaces 
far enough from geometric surfaces, so that all source terms are included, but 
close enough so that numerical dissipation has not significantly damped pressure 
fluctuations. Hence the use of high order methods with a good vortex transport is 
worthwhile here. For acoustic analysis surfaces extracted from the flow field are 
exported for every time step and then analysed by ACCO. ACCO returns total sound 
pressure levels as well as pressure fluctuation over time. Additionally, a Fast-Fourier 
transformation can be carried out. 


3 Computational Aspects 

3.1 Enhancement and Performance Study of the SUNWinT 
Code 

Our code SUNWinT showed an excellent parallel speedup before, e.g. a parallel effi¬ 
ciency up to 85 % using 2,048 cores was reached on both the HLRS NEC Nehalem 
Cluster as well as the HLRB II SGI Altix system [16,17], Sustaining this parallel 
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Fig. 1 Decomposition of a 
mesh in patches 
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speedup on new and upcoming HPC systems is the aim of current developments in 
our DG code. In a first step the single core performance of the code was improved in 
a major code redesign. The evolution of the surface and volume integrals in Eq. (5) 
is done now in small groups called patches. These patches consist of around 15 cells. 
A typical decomposition of the mesh in patches can be seen in Fig. 1 . 

The idea behind these patches is that the data in the patches is stored in a memory 
efficient way and so it could be transferred quickly from the memory to the CPU. 
Basically, there are two different kind of patches in a mesh, inner patches and 
border patches. Inner patches are all patches fully inside the computational domain, 
which do not border to a physical boundary or to a domain handled on another 
processor. These patches can be handled in the exactly same way. Border patches 
need a special treatment to ensure that they fulfill their boundary condition or to 
handle the communication between different processors via MPI. Additionally, the 
evaluation of the surface and volume integrals inside each patch was simplified in 
such a way, that all time-independent parts of these integrals are precalculated and 
efficiently combined with the time-dependent parts. A comparison of the single core 
performance of the new evaluation is shown in Table 1 for a 2D case and in Table 2 
for a 3D case. It can be seen that the computational time reduces significantly 
both for the Euler equation as well as the Navier-Stokes equations. The saving in 
computational time is larger doing higher-order calculations, e.g. for fourth order 
calculations the new evaluation is five times faster. Comparing 2D and 3D cases, no 
differences are present, the reduction in computational time is about the same. In 
the future, it is planned to use the patches for a hybrid parallelisation of our code. 
The idea is, that each node will handle a large amount of patches and will assign a 
patch to a free core for evaluation if the core is available. 

The determination of the parallel performance on the HLRS Hermit cluster is 
done with two different cases, the 3D simulation from the next chapter and another 
testcase. Both meshes are decomposed with METIS [15] from 16 zones up to 16,384 
zones. The first mesh consists of 336,000 cells, leading to only 82 cells per core, 
which are calculated on average on the finest decomposition level. The second 
contains 930,000 cells leading to an extremely low cell count of 57 cells on the 
finest level. Despite the small number of cells per core, the scaling in Fig. 2a is 
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Table 1 Comparison of computational time for old and new evaluation for a 2D case with different 
order and physical modelling on a single processor for 1,000 time steps 


Phys. model 


First order 

Second order 

Third order 

Fourth order 

Euler 

Old 

10.18 

76.1 

371.84 

1,267.46 


New 

5.63 

25.16 

76.98 

201.07 


Reduction(%) 

44.7 

66.9 

79.3 

84.1 

Navier-Stokes 

Old 

28.02 

235.6 

1,042.98 

2,725.31 


New 

12.23 

67.81 

227.06 

571.53 


Reduction(%) 

52.7 

72.9 

79.5 

79.6 


Table 2 Comparison of computational time for old and new evaluation for a 3D case with different 
order and physical modelling on a single processor for 200 time steps 


Phys. model 


First order 

Second order 

Third order 

Fourth order 

Euler 

Old 

15.26 

139.95 

1,017.45 

5,254.97 


New 

7.66 

44.05 

187.97 

1,218.78 


Reduction(%) 

49.8 

68.5 

81.5 

76.8 

Navier-Stokes 

Old 

43.53 

422.62 

2,727.41 

13,355.66 


New 

24.07 

147.43 

652.71 

2,079.4 


Reduction(%) 

44.7 

65.1 

76.1 

84.4 % 


a b 




Fig. 2 (a) Strong scaling and (b) efficiency on the HRLS Hermit 

very good for both cases, the time step per iteration decreases almost linearly up to 
4,096 cores with a parallel efficiency of 81 % (cf. Fig. 2b) for the first case and up 
to 16,384 cores with a parallel efficiency of 89 % for the second case. 


3.2 Performance Study with a Counter-Rotating Open Rotor 
Setup with FLOWer 

In order to determine the fastest way of computation a number of performance 
tests were run on the Hermit (Cray XE6) cluster at HLRS. The original setup 
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Fig. 3 Performance of FLOWer code with node pinning, (a) Cray compiler and (b) GNU compiler 


consists of roughly 50 million cells, where about 25 million cells can be assigned 
to the background mesh, which contains the cylindrical geometry for a simplified 
modelling of the CROR’s nacelle. The 9x7 rotor setup is modelled with one 
component grid per blade containing 1.5 million cells for each mesh. The total 
setup consists of 816 structured blocks with a maximum size of 75,000 cells per 
block. Figure 3a and b show the iterations per seconds carried out for different 
number of CPUs used per node. For the Cray compiler all three curves show a good 
parallelisation while with the GNU compiler the efficiency significantly decreases 
when all 32 CPUs of a node are used for a large number of CPUs. This may be 
due to a worse communication between the CPUs of the GNU compiled code, 
compared to the Cray compiler. For both compilers node pinning does not show 
a clear improvement except for the use of 816 CPUs on 51 nodes for the Gnu 
compiler. Only for small numbers of CPUs like 192 a slight improvement with using 
every second CPU can be seen. Using every forth CPU has even smaller effects on 
CPU time. 


4 Results 

4.1 Laminar Flow Around a Cylinder 

In order to study the usage of ACCO with our DG code SUNWinT the flow around 
a cylinder is calculated. The flow is investigated at Re = 150 based on the diameter 
D of the cylinder, and the freestream Mach number is A7 a = 0.1. Both 2D as well 
as 3D simulations were performed. The computational domain is 60D large and has 
a depth of two diameters in the 3D case. The mesh is of O-grid type and consists 
of 120x70(x40) cells. We use fourth order accurate P3 elements for the spatial 
discretisation and the temporal discretisation is chosen equally with an explicit, 
fourth order, one-step Runga-Kutta type scheme. The results are compared with 




Computational Aeroacoustics with Higher Order Methods 


247 



Fig.4 (a) Time-averaged pressure coefficient c p for the 2D and 3D simulation, (b) time- 
dependent lift and drag coefficient for the 2D simulation 


Table 3 Global forces C/j, 
Cl, a and the Strouhal 
number St 



c D 

C L ,a 

St 

2D 

1.33 

0.52 

0.187 

3D 

1.33 

0.52 

0.184 

Inoue and Hatakeyama 

1.32 

0.52 

0.183 


the results of Inoue and Hatakeyama [14] who used a sixth order finite difference 
scheme for this problem. 

The aerodynamic results of the 2D and 3D simulations are in very good 
agreement with the simulation of Inoue and Hatakeyama as the time-averaged 
pressure coefficients in Fig. 4a indicate. As a consequence, global forces are also 
quite similar (cf. Table 3). This kind of flow typically develops a vortex street behind 
the cylinder which can be seen in the time-dependent distribution of the lift and the 
drag coefficient (cf. Fig. 4b). The Strouhal number of this vortex shedding is in 
the 2D case 0.187 and in the 3D case 0.184 (Inoue and Hatakeyama: St = 0.183). 
Comparing the amplitude of both forces in Fig. 4b it is shown that the amplitude of 
the lift coefficient is much larger than for the drag coefficient. 

For the acoustic analysis the results of the 2D simulations are used. The hull 
surfaces for the acoustic solver ACCO are placed in a distance of three cylinder 
diameters away from the center of the cylinder. The observers for the acoustic 
analysis are placed on a circle 125D away from the cylinder. Figure 5a shows 
the sound pressure level of an observer position located 90° from the freestream 
direction for different frequencies. The acoustic behaviour is dominated by the 
aerodynamical behaviour, more precisely, the vortices induced by the cylinder 
leading to the oscillating forces. Peaks in the SPL are present at the Strouhal 
frequency and at higher harmonical frequencies. This dipole nature of the generated 
sound is seen on a directivity pattern (Fig. 5b) for the fundamental frequency and 
is in good agreement with the findings of Inoue and Hatakeyama who also detected 
the lift dipole as the major acoustic source. 
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Frequency 

Fig. 5 (a) SPL distribution vs. frequency for an observer at 90° from the freestream direction, (b) 
directivity pattern for the fundamental frequency 


4.2 A Counter-Rotating Open Rotor in Take-Off Conditions 

A 9 x 7 CROR configuration was simulated with standard FLOWer settings and 
WENO FLOWer at take-off conditions [10]. The environmental conditions were set 
to ICAO standard atmosphere conditions at sea-level. The simulation was carried 
out for eight rotor revolutions for both cases on the Hermit cluster at HLRS. The 
total computation time for standard FLOWer settings was approximately 240 h on 
384 CPUs for standard FLOWer settings and 270 h for WENO FLOWer. The WENO 
scheme was only carried out in a user-set number of blocks in the background grid 
in proximity to the rotors. 

4.2.1 Aerodynamic Results 

For comparison of the aerodynamic results integrated values such as thrust coeffi¬ 
cient, drive torque coefficient and propulsion efficiency are examined. Additionally 
the vortex conservation is shown as it is especially important for the rotor-rotor 
interaction which has a considerable contribution to CROR noise. 

Figure 6a and b show the thrust coefficient Cj and the torque coefficient Cm 
plotted over one rotor revolution. Both front and aft rotor show higher values for 
Ct and Cm accordingly for WENO. This is due to the higher order of the WENO 
FLOWer code which is similar to effects seen in previous grid studies, where 
higher grid resolution showed slightly higher thrust and drive torque levels. Hence 
the grid resolution used in this calculation still shows effects of grid dependency. 
Additionally the standard FLOWer code shows higher fluctuations of Cj and Cm 
especially for the aft rotor, indicating that it cannot represent the highly instationary 
incoming flow as good as the WENO FLOWer. The fluctuations of Ct and Cm 
for the front rotor are in roughly the same range for the standard and the WENO 


version. 
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Fig. 6 (a) Ct and (b) Cm over one rotor revolution 



Fig. 7 Vortex visualisation of the flow field calculated with (a) standard FLOWer and (b) WENO 
FLOWer 


The characteristic numbers such as Ct and Cm differ by less then 1 % for front 
and aft rotor as well as the CRORs totals. For the front rotor the same propulsion 
efficiency is reached with standard and WENO FLOWer whereas for the aft rotor 
and the total CROR the propulsion efficiency differs by 0.1%. However, both 
FLOWer versions show excellent agreement, but the WENO version is generally 
more capable of unsteady flow phenomena. 

Comparing Fig. 7a and b the vortex transport with the WENO version is better 
than with the standard version. Especially the vortices that have already passed the 
aft rotor plane are conserved better with WENO. Wakes and vortices are also more 
clearly defined with WENO. 
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4.2.2 Acoustic Results 

The acoustic emission of the CROR was evaluated over one rotor revolution with a 
resolution of 360 time steps for both standard and WENO FLOWer. The IAG tool 
ACCO was used to obtained sound pressure levels at microphone observers in the 
x-y plane, where the x-axis is the rotational axis pointing in flow direction and the 
y-axis is perpendicular so that x and y span a plane perpendicular to the rotors. 
The microphones were put in a distance of 9 = 5° to each other with 9 = 0° on 
the negative x-axis and 9 = 180° on the positive x-axis. 9 = 90° is located at an 
radius of r = 5 m to the rotational axis with equal distance to front and aft rotor. 
Comparing the noise directivities generated from standard FLOWer and WENO 
FLOWer CFD simulations, it is visible that both show the same characteristics, that 
is, a main peak at 9 m 95° and minor peaks at lower and higher 9, see Fig. 8a. 
These minor peaks are significantly higher for the WENO case whereas the main 
peak reaches the same levels for both simulations. The main peak shows a stronger 
decay for angles below and above 9 rs 95° for the standard FLOWer simulation. 

Figure 8b, c show the tonal noise of front (BPF 1) and aft rotor blade (BPF2) 
passing frequencies and one higher harmonic (2BPFI, 2BPF2) which are the main 
contributions of noise in radial direction of the rotors. For the BPFs both code 
versions show a very good agreement while for the higher harmonic a discrepancy 
especially on the front rotor can be seen. The nine blade front rotor has a higher BPF 
and first harmonic than the seven blade aft rotor. The ability of resolving frequencies 
in the acoustic analysis is strongly dependent on the temporal and spatial resolution 
of the CFD simulation, hence WENO FLOWer can still resolve the first harmonic of 
the front rotor opposing to the standard FLOWer. For the aft rotor this effect cannot 
be seen yet as standard FLOWer is able to resolve the first harmonic as it lies at a 
lower frequency. 

Figure 8d-f show the rotor-rotor interaction noise which occurs at frequencies 
that are a linear combination of BPFs such as BPF 1 + BPF2, 2BPF\ + BPF2 
and BPFl + 2BPF2. The lowest interaction frequency BP FI + BPF2 is mainly 
responsible of the minor peaks on the total noise, cf. Fig. 8a. Here the discrepancies 
between standard and WENO FLOWer become exceptionally clear. The transport 
of the front rotor wake and blade tip vortices is worse using standard FLOWer than 
using WENO FLOWer for CFD calculations. For the second interaction frequency 
(Fig. 8e) this effect can still be examined while for the third interaction frequency 
(Fig. 8f) it is not visible at all. This leads to the conclusion that also the resolution 
of WENO FLOWer is not sufficient to obtain noise directivites at these high 
frequencies. 

For the BPFs, their first harmonics and lower interaction frequencies good results 
can be achieved with both code versions. These are the relevant frequencies as 
they are the main contributors to total noise. Their level is roughly lOdB higher 
than second or higher harmonics of BPFs and higher interaction frequencies. Both 
codes show good agreement for the low frequencies and as expected the results 
differ at higher frequencies. The main noise characteristics can be examined with 
the data obtained by both CFD simulations, but standard FLOWer reaches the 
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Fig. 8 Noise directivity in x-y-plane. (a) Total sound pressure level, (b) blade passing frequency 
of front rotor, (c) blade passing frequency of aft rotor, (d) first interaction (BPF 1 + BPF2) 
frequency, (e) second interaction frequency (BPF 1 + 2BPF2), and (f) third interaction frequency 
(2BPF1 + BPF2) 
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limits of resolution with first harmonics of BPFs and shows weakness for the lower 
interaction frequencies. Yet the directivities for lower interaction frequencies gained 
with standard FLOWer still show the same characteristics as with WENO FLOWer. 
A good qualitative agreement with the results of measurements by Woodward could 
be achieved for both standard and WENO FLOWer [11]. 


5 Conclusion 

Two different test cases were simulated in this paper with two different CFD codes 
and analysed acoustically. Both codes are capable of discretising the Navier-Stokes 
equations with a higher order method. The simulation of the flow around a cylinder 
at Re = 150 with the DG code SUNWinT was a first step to analyse the potential 
of an acoustic analysis with the DG method. It is in good agreement with the results 
of Inoue and Hatakeyama [14]. The aerodynamic forces and the distribution of the 
time-averaged pressure coefficient do not show any differences. The behaviour of 
the cylinder as an acoustic lift dipole was also clearly verified by the peaks of 
the SPL in the frequency analysis and in a directivity pattern of the fundamental 
frequency. In a next step, the method will be used for the acoustic analysis of 
turbulent flows. Concerning the performance of SUNWinT different improvements 
have been made which reduced the computational time while maintaining the good 
parallel performance. 

A 9 x 7 CROR configuration was simulated with standard and WENO FLOWer 
at take-off conditions. Aerodynamic and acoustic results showed a good agreement 
with both code versions. Only a slight discrepancy can be detected for the integrated 
aerodynamic values. Comparison of the flow fields obtained with the different 
versions illustrates that the vortex transport is better with WENO FLOWer. Flowever 
with good spatial and temporal resolution an acceptable transport can also be 
achieved with standard FLOWer. The acoustic evaluation showed that for the single 
rotor noise the two codes versions hardly differ. The interaction noise contribution 
can be examined much better with WENO FLOWer. This was expected as WENO 
FLOWer showed a better vortex transport which causes the interaction of the front 
rotor wake with the aft rotor leading to the interaction noise. Generally the use of a 
higher order method is advisable for acoustic analysis as the quality of the acoustic 
results is directly dependent on the CFD results with the FW-H method used in this 
paper. With WENO FLOWer the computational effort is acceptable compared to the 
beneficial improvement of results. 
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Effects of an Oblique Roughness on Hypersonic 
Boundary-Layer Transition 


Gordon Groskopf and Markus J. Kloker 


Abstract The compressible bi-global linear stability theory (B-LST), based on 
two-dimensional eigenfunctions in flow crossplanes, as well as direct numerical 
simulation (DNS) are applied to investigate the stability properties of three- 
dimensional laminar hypersonic boundary-layer flows with discrete surface rough¬ 
ness. The obliquely-placed fence-like roughness element has a height of half the 
unperturbed thermal boundary-layer thickness at its position on the flat plate. The 
roughness setup is derived from a Space Shuttle flight experiment. A cold-flow 
non-reacting gas case at wind-tunnel conditions with Mach 4.8 is considered. The 
laminar steady base flow is extracted from a (D)NS base-flow solution assuming 
perfect-gas behavior. Local and integral growth of instabilities in the wake of 
the discrete roughness element are investigated revealing a considerable persistent 
amplification in the flow. The cold-flow results are compared to a recent instability 
analysis of a hot-flow, reacting gas case with identical Mach number and roughness 
geometry. A full DNS of the cold-flow case validates the B-LST results: The growth 
of the excited disturbance modes shows good agreement. Performance data for the 
applied codes running on the NEC SX-9 and CRAY XE6 are given. 


1 Introduction 

Boundary-layer transition at hypersonic flow speeds is of particular importance to 
the design of respective vehicles. On one hand the vehicle structure encounters a 
massive instantaneous as well as integral heat load. The amount of heat the vehicle 
has to withstand is significantly influenced by surface roughness, e.g., the gap filler 
protruding from the thermal protection system of the Space Shuttle (Fig. 1). In case 
of sustained hypersonic flight, aerodynamic drag is another crucial issue, especially 
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Fig. 1 Ceramic-tile gap filler 
protruding from the thermal 
protection system of the 
Space Shuttle (Source: NASA 
(www.nasa.gov)) 



with regard to fuel consumption. Both heat load and drag are to be minimized by 
appropriate measures. On the other hand there are cases where the flow is desired 
to be turbulent to alleviate the effects of flow separation, or to enhance mixing. 
Therefore it is essential to understand how discrete surface roughness influences 
and alters the laminar-turbulent transition mechanisms. 

Progress has been achieved recently in this field. The experimental approach (see, 
e.g., [6]) peaked in the STS-119 flight experiment. A numerical ansatz has been 
applied by, e.g., [3, 4, 15, 16]. Theoretical analyses of the disturbance growth in 
steady base flows have been done by, e.g., [7-1 1, 14]. 

Applying the B-LST in crossplanes in the wake of the roughness element, [7] 
and [9] show that symmetric as well as oblique, with respect to the oncoming 
flow, roughness geometries with half the height of the undisturbed boundary layer 
enhance persistently the growth of first-mode-type instability modes due to the main 
trailing vortices and ensuing velocity streaks in the wake of the roughness. Within 
the parameter range investigated the contribution of the horseshoe vortices is of 
minor importance. This is in accordance with “Condition I” of [3]. The horseshoe 
vortices attain primary importance if the height of the roughness element reaches or 
exceeds the undisturbed boundary-layer thickness. 

While in [3] the unsteady flow behind a three-dimensional, cylindrical roughness 
element with a 0(<5)-height has been simulated by means of a Navier-Stokes solver, 
[15] investigates the evolution of well-defined disturbances in a flat-plate boundary- 
layer flow altered by a two-dimensional roughness element by means of unsteady 
DNS, supported by results from primary linear stability theory. It is shown that a 
roughness with a size of up to 70 % of the undisturbed boundary-layer thickness 
alters the stability properties of the flow only locally. Such a discrete 2-d element 
represents a disturbance amplifier with a limited bandwidth: A selected frequency 
range gains in amplitude locally by the roughness. 

In this paper we investigate the evolution of disturbances in a flow at wind-tunnel 
conditions altered by a three-dimensional, obliquely-placed roughness. The shape 
of the roughness element follows the geometry used in [10]. It is derived from the 
STS-119 flight experiment, which investigated the influence of an isolated rough¬ 
ness, mounted on the Space Shuttle’s belly, on laminar-turbulent transition during 
re-entry. The results of the stability analysis are compared to the hot-flow case from 
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[10] and [11] exhibiting the same roughness geometry with respect to the thermal 
boundary-layer thickness St at the same position on the flat plate in terms of Re$ r 
and Mach number. 


2 Governing Equations, Numerics and Flow Setup 
2.1 Governing Equations 

The three-dimensional unsteady compressible Navier-Stokes equations are used in 
a non-dimensional form. Length scales are normalized by a reference length L* 
(* marks dimensional quantities). Velocities u, v, and w in streamwise (x), wall- 
normal (y), and span wise (z) direction are normalized by the freestream velocity 
u* x . Furthermore, the respective freestream values are used as reference for density 
p, Temperature T, thermal conductivity 0, and viscosity //. Non-dimensional 
pressure p, and time t are based on the reference values p^w^ 2 , and L*/u*^. 


2.1.1 Direct Numerical Simulation 

The equations are applied in conservative formulation (p, pu, pv, pw, E), with E 
being the total energy per unit volume. The fluid is assumed to follow the calorically 
perfect-gas assumption. Thus, a variation of heat capacities with temperature as well 
as chemical reactions are excluded. The adiabatic exponent and Prandtl number are 
fixed to k = 1.4 and /V = 0.7 1 , respectively, for air. The viscosity is computed from 
Sutherland’s law. For more details see [1] and [2]. 


2.1.2 Bi-global Linear Stability Theory 

The B-LST equations are based on the Navier-Stokes equations formulated in 
primitive variables (p, u , v, w, T). All flow quantities are split into a steady base 
flow and an unsteady perturbation. The assumptions of the (primary) linear stability 
theory hold complemented by the following specifics for the bi-global approach 
in crossplanes: A non-zero wall-normal base-flow velocity is allowed as long as 
its spanwise mean is zero, and the complex amplitude of the perturbation’s modal 
ansatz is two-dimensional. Analyzing flow crossplanes, the amplitude distribution 
extends in wall-normal and spanwise direction: 

0(x,y,z,f) =Hy,z)e i(ax - <ot \ (1) 

In case of the temporal approach, used here for the computation of the eigen- 
modes, a is real and co complex. In the more natural spatial approach a is complex 
and w real. The imaginary part of the complex a = a, + iaj, however, can be 
deduced with sufficient accuracy from the temporal theory using Gaster’s relation 
(see, e.g., [5]): 
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Table 1 Flow parameters Case Cold flow Hot flow (from [10]) 


Re L 

10 5 

10 s 

McIcg 

4.8 

4.84 

Proo 

0.71 

0.67 

K oo 

1.4 

1.36 

L*( m) 

8.5- 10~ 3 

1.06 

“So (m/s) 

716.3 

6,668 

?£(K) 

55.4 

4,006 

PtoiPa) 

1,000 

2,432 

R, r 

1,225 

3,226 

Re 8Tu 
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Cg r 

where c gr is the group velocity of the perturbation. This relation has been found to 
agree excellently with the spatial solution of the eigenvalue problem for the present 
flow cases (see [9]). Further details can be found in [7,9, 10]. 


2.2 Numerics 

2 . 2.1 Direct Numerical Simulation 

For a detailed description of the basic algorithm applied to solve the unsteady 
Navier-Stokes equations see [1,2]. Sixth-order compact finite differences (FDs) 
are applied for the spatial discretization in streamwise, wall-normal and spanwise 
direction. The time integration is based on the classical fourth-order four-step 
Runge-Kutta scheme. The equations are solved on a structured curvilinear grid. The 
implemented algorithm runs on various computer architectures using a combination 
of distributed- and shared-memory parallelization. 

Various boundary conditions are implemented and can be prescribed according 
to the investigated flow. In spanwise direction periodicity is assumed with a domain 
width of Aj. = 3.2 (for L* see Table 1). For steady supersonic (base) flow all flow 
quantities are fixed at the inflow plane. An adiabatic no-slip condition is prescribed 
at the wall. 

In case of an unsteady DNS the velocity and temperature disturbances are fixed 
to zero at the wall; the underlying base flow is computed applying the adiabatic wall 
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condition. Defined disturbances can be excited via (synthetic) blowing and suction 
through slits or holes at the wall. For further details see [11]. 


2.2.2 Bi-global Linear Stability Theory 

For the solution of the linear stability equations, base flow as well as perturbations 
are discretized on a structured curvilinear grid in the crossplane. The flow-quantity 
derivatives are modeled using compact FDs of up to tenth order in spanwise 
direction. A spectral Chebyshev collocation method is applied in wall-normal 
direction. Clustering grid points in regions of high shear away from the wall is 
possible using a grid transformation. 

The temporal approach of the stability theory poses a linear eigenvalue problem 
(EVP). This is solved applying the implicitly restarted Arnoldi method [13]. The 
solution according to the spatial approach is obtained either by an iteration of 
the generalized approach, a complex and u> complex, as the temporal growth 
rate a> ; converges to zero or by Gaster’s relation, described above. The excellent 
agreement of these methods for the cases investigated so far does not justify the 
additional computational cost of the former method for the minor gain in accuracy. 
The algorithm for eigenvalue tracking is based on a best-match approach for the 
eigenvectors at two consecutive tracking steps. The tracking process may become 
difficult due to eigenmodes crossing in the complex eigenvalue plane. 

The perturbations are assumed to be periodic in spanwise direction. At the wall 
vanishing velocity and temperature perturbations are prescribed. In the freestream 
all perturbations are assumed to decay exponentially. 


2.3 Flow Setup 

2.3.1 Roughness Geometry 

The investigated roughness geometry follows the setup of [10], see Fig. 2. The 
height of the roughness k equals about 50 % of St of the unperturbed flow at the 
streamwise position of the element. The fence-like element (e = 8 k wide and b = 2k 
thick) is placed obliquely, skewed by i fr = 45°, with respect to the oncoming flow. 
The spanwise spacing between the roughness elements and, thus, the periodicity 
length is chosen to be four times the spanwise width of the element, s z = 4e. 
The edges of the element are rounded due to numerical restrictions. The (minor) 
differences between sharp- and soft-edge elements have been discussed in [10]. 


2.3.2 Flow Conditions 

The flow parameters for the cold-flow (wind-tunnel condition) case, and, for 
comparison, the hot-flow parameters from [10] are listed in Table 1. Both cases 
exhibit a Mach number of about 4.8. The roughness element has been placed in both 


260 


G. Groskopf and M.J. Kloker 



Fig. 2 Sketch of roughness setup. Left, top view with element dimensions b and e, spanwise 
spacing s z , and rotation angle \/f. Right: front view looking downstream (Source: [10]) 


cases at about Re$ r = 20,000 based on the undisturbed flat-plate flow to have this 
physically important parameter identical at first, see the further discussion below. 
This results in R Xr = 1,225 and Re^ = 434 for the cold-flow case, and R Xr = 3,226 
and R( J kk = 6,000 for the hot-flow case. The values of Retk are based on flow 
quantities of the undisturbed flat-plate flow at the roughness position and height: 


Rekk = 


p (x r ,k) u {x r ,k) k 
P (x r ,k) 


(3) 


Re^k is much higher, at identical Reg T , for the hot flow because, i.a., 
p (x r , k) / p e (x r ) and u(x r ,k) /u e {x r ) are 3.2 and 1.4 times larger, and 
p(x r ,k) //q, (x r ) is 3.1 times lower compared to the cold flow. Thus stronger 
instability has to be expected if this parameter plays a crucial role. 

The Reynolds number 


Re k 


Pe (x r ) U e (x r ) k 
Re (Xr) 


10 4 = Res T 


k 

8j 


(4) 


holds for both flows. Note that according to experimental studies [12] the scaling 
of 3-d-roughness-induced transition follows a Rek = constant criterion like for 
2-d roughness in incompressible flow. In this respect the cases compared here might 
be thought of being equally unstable. However, the hot-flow case implies a strongly 
cooled wall, T Wrad / T Wai < 0.1 at Ma = 4.8, translating into a strongly (second¬ 
mode) unstable flow. In fact it turns out that, for the R x value at the roughness 
considered, the real, uncontrolled flow would have already transitioned to turbulence 
due to second-mode instability. If, however, an ultrasonically absorptive coating 
would be used that weakens or suppresses second-mode instability, the roughness 
would lead to transition anyway because it induces vorticity (first) modes. Thus the 
laminar hot-flow case may not be realistic in any case but serves as a yet meaningful 
case for comparison. 
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Fig. 3 Top view of vortex structures. Visualization by means of the (3-criterion (£? = 0.07). 
HV, MV, and IV denote the horseshoe, main, and inner vortex at leading (L) and trailing (T) 
edge of the roughness element. Recirculation zones are shown by black isosurfaces of u < 0. 
Gray bars indicate position and extent of the roughness element. Top: color shading indicates 
streamwise vorticity &> x ; red', clockwise rotation sense as seen in downstream direction; green: 
counter-clockwise rotation. Bottom: color shading indicates height y / k above the wall 


3 Base Flow 

The steady base flow for the B-LST analysis is extracted from the (D)NS base 
flow. In spite of the unsteady formulation of the Navier-Stokes equations the flow 
converges to a steady state. Convective exponential growth of numerical background 
noise (round-off error, 10 -13 ) can be seen downstream the roughness in a temporal 
Fourier analysis, but the amplitudes of the analyzed frequencies do not exceed a 
value of 10 -9 at the end of the integration domain, and can therefore be neglected. 
For identical setup another base-flow solution had been computed with a different 
code, see [10]. This code has however not been designed for DNS and thus has a 
lower accuracy order. Differences in the resulting B-LST growth rates are discussed 
in Sect. 4.2. 

A symmetric roughness configuration excites three pairs of equally strong 
counter-rotating vortices, with the main vortices being stronger than the horseshoe 
vortices (see [9]). The naming convention for the oblique setup follows [10]. 
Here the vortices induced at the leading edge (L) of the roughness element are 
significantly stronger than the trailing-edge (T) vortices. Thus, the leading-edge 
main vortex (LMV) becomes the dominant flow structure in the wake of the 
roughness. This is revealed by a vortex visualization using the Q -criterion, see 
Fig. 3. 
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Fig. 4 Streamwise velocity ( u ) contours. Solid lines are isolines of u, starting with u = 0.1 
near the wall, ending with u = 0.95. Left: color shading indicates vortices and rotation sense 
shown in Fig. 3 at Q = 0.01. Right, contours of temperature. First row: streamwise position 
(x — x r ) / k = 30, second row: ( x — x r ) / k = 100, third row: (jr — x r ) /k = 200. Dashed lines 
show a projection of the roughness geometry 


The horseshoe vortex (LHV) is less pronounced than the LM V. Whereas the pair 
of inner vortices (IV) vanishes shortly behind a symmetric roughness [9], in the 
oblique setup the leading-edge inner vortex (LIV) is part of the formative vortex 
structure in the wake of the roughness element. The vicinity of LMV and LIV leads 
to interference between these co-rotating vortices, which in turn inhibits a soon 
crossflow-vortex-like turnover. The continuous lift-up of the vortices downstream 
of the roughness element can be observed in Fig. 3 (bottom). The vortices’ positions 
above the wall can also be deduced from Fig. 4 (left). 

The influence of the vortices on skin friction and wall temperature is shown in the 
flow’s footprints in Fig. 5. The high-speed and the low-speed streaks can clearly be 
identified in terms of the c/„ values. The dominant leading-edge high-speed streak 
induces a pronounced streak of high wall temperature. The roughness element itself, 
however, is only slightly heated compared to the flat-plate temperature in front of 
the roughness and the main temperature rise downstream. The extent of the high- 
temperature streak in wall-normal direction can also be observed in Fig. 4 (right). 

The flow around the roughness element can be visualized by means of stream¬ 
lines, see Fig. 6. Near-wall streamlines in front of the roughness (Fig. 6a and b) 
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Fig. 5 Top : contours of skin friction coefficient c r n normalized by smooth-wall value. Bottom: 
wall temperature T w . Black solid lines indicate position and extent of the roughness element 


are driven around the roughness element in spanwise direction. Coming around 
the leading edge a small fraction is lifted up by the recirculation zone in the 
element’s wake, gets accelerated by fluid flowing near the boundary-layer edge, 
and farther downstream joins the LIV. Figure 6c, d show a similar scenario of 
the development of LIV and LMV, though the streamlines no longer bend around 
the roughness element but flow across. The LMV mainly gathers streamlines that 
come from the height y / k ss 1 in front of the roughness, whereas the LIV 
streamlines originate from lower heights. Figure 6e, f show the deformation of 
streamline layers being located above the element height upstream the roughness. 
The streamlines are deflected to the shape of the velocity streaks, roughly following 
the upper n-isolines of the crossplane at (x — x r ) / k = 90. 


4 Stability Analysis 

With the B-LST several unstable eigenmodes have been identified at 
(x —x r ) / k = 30, and afterwards been tracked in terms of the streamwise direction 
and varying frequency to span a stability diagram for each eigenmode. To avoid 
missing an eigenmode that is not yet amplified in the near wake of the roughness 
but becomes unstable farther downstream another search has been conducted at 
(x — x r ) /k = 250. 
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Fig. 6 Streamlines of the cold flow initiated in crossplane (x — x r ) / k = —10 at heights (a) 
y/k = 0.1, (b) y/k = 0.5, (c) y/k = 0.75, (d) y/k = 1.0, (e) y/k = 1.5 and (f) y/k = 2.0. 
Flow direction is from upper right to lower left comer. Color shading indicates the absolute value 
of the velocity vector |v| = sju 2 + v 2 + w 1 . Black isosurfaces show regions of separated flow 
(u < 0) enclosing the roughness element. The crossplane in the lower left corner displays u-isolines 
at (,v — x r )/k = 90 


The spatial growth rates 


oti = 


d , A (x) 

—In - 

dx Aq 


(5) 


are gained from the computed temporal ones 
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Fig. 7 Normalized modulus of the perturbation amplitude for stream wise velocity u' = \u(y,z)\ 
(first row), temperature T' (second row) and pressure p' (third row), plus phase relation of 
pressure amplitude <P p = arg [p (y, z)] (fourth row) for the two most unstable eigenmodes at 
(x — x r ) /k = 90. Left: mode Cl at co r S u ~ 0.80. Right: mode C2 at co r S u rs 0.78. Solid lines are 
base-flow w-isolines for first and third row, T -isolines for the second row, and pressure perturbation 
amplitude for the fourth row 


, A(t) 
In—— 
t Aq 


( 6 ) 


by application of Gaster’s relation, see Eq. (2). Based on the stability diagram 
the integral growth, quantified by the iV-factor development, is computed for 
fixed frequencies from spatial growth rates extracted as a function of streamwise 
coordinate. The obtained data for the cold flow are compared to the hot-flow stability 
data from [11], 


4.1 Unstable Eigenmodes 

The eigenfunctions of the two most unstable eigenmodes found within the investi¬ 
gated streamwise range of 30 < (x — x r ) /k < 250 are shown in Fig. 7. Amplitude 
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distributions of streamwise velocity, temperature and pressure are plotted. Addi¬ 
tionally the phase relation of the pressure amplitude is displayed. The maxima of 
the perturbation amplitudes coincide with regions of large gradients within the base 
flow, which is shown in the plots for comparison. The shape of the temperature 
eigenfunctions is similar to the corresponding M-velocity amplitude distribution 
though the temperature perturbations reach considerably larger amplitudes. The 
pressure eigenfunctions feature non-negligible amplitudes at the wall. 

The modes Cl and C2 exhibit a similarity to the even and odd eigenmodes found 
in the wake flow of symmetric roughness elements, see [7] and [9]. For the even 
mode the phase of the local eigenfunction maxima is symmetrical with respect to 
the symmetry line; the odd mode exhibits an anti-symmetric phase relation. The 
amplitude distributions of modes Cl and C2 are strongly distorted compared to 
the modes of the symmetrical flow, but the corresponding phase relations of the 
eigenfunctions’ maxima reveal their even and odd nature, see Fig. 7 (fourth row) for 
the phase relation of the pressure perturbation amplitude. 


4.2 Local and Integral Growth of Unstable Eigenmodes 

Tracking the unstable eigenmodes along varying frequency at a fixed stream- 
wise location yields the results shown in Fig. 8 (left). In the near wake at 
(x — x r ) /k = 30 the mode Cl is dominant. For comparison the hot-flow mode 
results from [11] are also shown: The hot-flow LIV mode has a similar band of 
amplified frequencies but an amplification rate that is roughly three times higher. 
The hot-flow LMV mode shows a considerably broader amplified-frequency band. 
Its maximum growth rate is only slightly lower than the LIV mode’s and is located 
at a four times higher frequency. Note that the hot-flow modes’ eigenfunctions 
significantly differ from the cold-flow modes due to major differences comparing 
the base flows. For details see [10] and [11]. Going downstream the amplified 
frequency bands of the modes widen, the corresponding maxima shifted to higher 
frequencies (not shown). The maximum growth rate of mode Cl persists compared 
to the near-wake value, whereas in the hot flow the amplification rate of the LMV 
mode increases considerably. 

/V-factors, N = — f a t dx, (Fig. 8, right) are gained from a streamwise tracking 
of the eigenmodes. For clarity and with the exception of mode C l only the integrally 
most amplified frequency of each mode is shown. The hot-flow results are again 
taken from [11]. Additionally, eigenmodes of the smooth flat-plate flow are shown 
for comparison. The cold-flow smooth-plate N -factors have been integrated starting 
shortly behind the edge of the flat plate at about (x — x r ) /k = —125 whereas the 
curves of the roughness eigenmodes inherently start downstream of the roughness 
where they first have been detected. For the hot-flow condition the corresponding 
smooth-plate N -factor has been integrated starting downstream of the roughness. 
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Fig. 8 Left, spatial growth rates a t 8 u as a function of frequency a > r 8 u (normalized by the 
undisturbed boundary-layer thickness S u at x = x r ) at streamwise position ( x — x r ) /k rs 30. 
Cold Cl mode ( blue solid line with square symbols), cold C2 mode ( blue dashed line, squares). 
Hot LMV mode ( red solid line, circles), hot LIV mode (red dashed line, circles) from [11], Right. 
N -factors over ( x — x r ) Ik. Cold Cl mode at a> r 8 u = 1.82 (blue solid line, diamonds)', cold Cl 
mode at a> r 8„ = 0.91 (blue solid line, squares)', cold C2 mode at a> r 8„ = 0.57 (blue dashed 
line, squares). For comparison: cold smooth-plate second mode (propagation angle i p = 0°) at 
to r 8 u =1.8 (blue dash-dot line, squares)', cold smooth-plate first mode (tp R5 65°) at w r 8 u = 0.57 
(blue dash-dot line, filled squares). Hot LMV mode at co r 8 u = 3.3 (red solid line, circles)', hot LIV 
mode at w r 8 u = 0.82 (red dashed line, circles)', hot smooth-plate second mode at a> r 8 u = 2.1 (red 
dash-dot line, circles) again from [11]. Thin black dash-dot-dot lines indicate N = 5 and N = 9 


In the cold flow the eigenmode Cl gains the highest integral amplification to 
the end of the investigated range: N = 6.5 for a) r 8 u = 0.91. Employing a base- 
flow solution based on a solver with less accuracy order, only N = 5 is reached 
(see [10]), despite the visual differences in the base flow are small. However, 
&V<5„ = 1.82 exhibits higher local growth for (x — x r ) /k > 115 matching the 
growth rate of mode C2 at the domain’s end; compare the slopes of the corre¬ 
sponding N -factor curves. The still increasing local growth of C2 at that location 
implies that this mode may take over in integral growth farther downstream as well. 
Nonetheless mode Cl at &v<5„ = 0.91 reaches an /V-factor of about 5, which has 
been found sufficient for transition to turbulence in wind-tunnel experiments under 
noisy conditions, at about 180 roughness heights downstream the roughness’ center. 
The frequency a> r 8 U = 1.82 follows at (x — x r ) / k fx 230. The integrally most 
amplified cold-flow smooth-plate eigenmodes gain /V-factors of about 3.6, and 2.9 
for the first, and second mode, respectively, at the end of the investigated domain at 
(x — x r ) /k = 250. 

The hot-flow modes are much stronger amplified. Within 50< (x — x r ) / k < 150 
the hot-flow LMV mode grows more than twice as strong as the most amplified 
smooth-plate eigenmode which is a second mode. The strong wall-cooling 
{T»' r ad/Twad ~ 0.1) does not seem to significantly weaken the shear-layer modes 
found. 
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4.3 Superposition of Finite-Amplitude Eigenmodes 
and Base Flow 

The shown eigenmode C1 has been assigned a finite amplitude and superposed on 
the respective base flow to get an impression of the three-dimensional disturbed 
flow. Note that the spanwise mean and the shape of the disturbances keep (unnat¬ 
urally) since the nonlinear interaction of the finite disturbance and the base flow 
cannot be taken into account within the linearized approach. 

The visualization is based on one single crossplane of the base flow and 
the parameter of this crossplane’s eigenmode. The complex amplitude of the 
perturbation’s modal ansatz (see Eq. (1)) defines the oscillation’s physical amplitude 
as well as the relative phase relation of the eigenfunction in the crossplane. The 
streamwise wavenumber a r and frequency <o r determine the spatial wavelength and 
temporal development, respectively. 

The base-flow crossplane at the respective streamwise location has been extruded 
in v-direction to four times the wavelength of the shown mode. At the beginning 
of the interval shown the eigenmode has been added to the base flow assuming 
an initial amplitude of 5 % for the streamwise velocity disturbance. In streamwise 
direction the perturbation is allowed to grow with the spatial amplification rate a, 
of the mode at the reference crossplane. The results are shown in Fig. 9. 

The even disturbance mode Cl eventually leads to almost symmetrical 
horseshoe-type secondary vortices traveling on top of the inducing base-flow 
vortices (Fig. 9, left). The pressure footprint of the perturbed flow (Fig. 9, right) 
is broader than the respective generating horseshoe vortex. The w-isolines in the 
crossplanes shown also indicate the modulation of the flow by the developing 
finite-amplitude disturbances. 


5 Direct Numerical Simulation with Controlled 
Unsteady-Disturbance Input 

For the cold-flow scenario an unsteady DNS is conducted. The disturbances are 
introduced via blowing and suction through a hole at the wall by prescribing ( pv)' 
with a smooth function in time and space. For details see [11]. The hole has a 
diameter of 3.6 k and is located upstream of the roughness at (a — x r ) /k = — 117 
at the element’s streamwise center line ( z = 0). The frequencies co r 8 u = 0.91 and 
co r 8„ = 1.82 are excited continuously at a level of (pv)' max = 10 -7 each (6h = 0), 
corresponding to a bi-harmonic point source. The DNS results are analyzed 
applying a temporal Fourier analysis. The amplitude distributions for the two excited 
frequencies are shown in Fig. 10 for one streamwise location. For comparison the 
corresponding B-FST eigenfunctions are also plotted. DNS and B-FST amplitude 
distributions agree well for w r 8 u = 0.91. The number of maxima is reproduced cor¬ 
rectly by the FST results. However, their location slightly differs. For o>, <5„ = 1.82 
the agreement is good as well. 
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Fig. 9 Timesteps of animated eigenmode Cl at (x — x r )/k = 90. Left: /^-criterion 

(A 2 = —0.19), color shading indicates the height above the wall. Right, pressure at the wall. First 
row. t/T = 0, second row t/T = 0.2, third row t/T = 0.4 and fourth row t/T = 0.6. 
Flow direction is from lower left to upper right comer. The gray shading indicates the unperturbed 
vortices of the steady base flow. Crossplanes display M-isolines 
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Fig. 10 Comparison of normalized disturbance amplitude distribution from DNS (contours) and 
LST (thick black solid lines) at ( x — x r ) /k = 150 for frequencies w r S u = 0.91 (left) and 1.82 
(right). Thin solid lines are corresponding base-flow isolines, dashed for negative values. First row: 
streamwise-velocity disturbance u!. Second row: wall-normal-velocity disturbance V. Third row: 
spanwise-velocity disturbance w'. Fourth row: temperature disturbance T' 


The //'-amplitude evolution with roughness is displayed in Fig. 11, compared 
to the smooth-plate case. Up to a short distance in front of the recirculation zone 
upstream of the roughness, (x — x r ) /k ss —15, the disturbance evolution is 
identical. This zone acts like an amplifier for the disturbances (see the inset of 
Fig. 11). Marxenet al. [15] have found an analogous behavior for a two-dimensional 
roughness element under identical flow conditions. Across the roughness itself the 
disturbances are damped. Whereas the flow behind the two-dimensional element 
resumes soon the smooth-plate flow, here the flow is persistently deformed by 
the three-dimensional roughness element. The vortices and subsequent velocity 
streaks enhance the disturbance amplification significantly. The (co r 8 u = 0.91)- 
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Fig. 11 Comparison of u- velocity-disturbance amplitude growth as function of (x — x r ) /k from 
DNS (lines) and LST ( symbols ) for frequencies a> r S u = 0.91 ( blue long-dashed line and squares) 
and 1.82 (blue solid line and diamonds). Blue dash-dot line shows amplitude of mean-flow 
deformation (oi r S tl = 0), thin blue solid lines represent further nonlinear products of excited 
frequencies. Green lines indicate disturbance evolution in the smooth-plate boundary layer. The 
inset displays the region in the vicinity of the roughness for the two excited frequencies in detail, 
the vertical dot lines indicating the streamwise extent of the oblique element 


disturbance grows by a factor of 6,240, corresponding to an N -factor of 8.7, 
between —15 < (x — x r ) /k < 385. The corresponding smooth-plate disturbance 
reaches an amplitude ratio of only 3.4 (N = 1.2) within the same streamwise range. 
The ( co r 8 u = 1.82)-disturbance even reaches an N -factor of 10. The equivalent 
smooth-plate disturbance experiences a local amplification only. After decaying 
along the plate up to (x — x r ) /k = 120 it finally reaches its maximum amplitude at 
(x — x r ) /k = 240. At the end of the computational domain the disturbances are 
still linear: The amplitude of the mean flow deformation reaches only 4 ■ 10~ 5 at 
(x — x r )/k = 385. 

To compare with B-LST the amplitudes have been matched at (x — xy) /k = 60 
for the frequency co r 8„ = 0.91, and at (x — x r ) /k = 150 for co r 8 u = 1.82. For the 
frequency w r 8 u = 0.91 the agreement is excellent, for co r 8 u =1.82 the growth rates 
obtained by the B-LST analysis are too low. 


6 Computational Aspects 

6.1 Bi-global Linear Stability Theory 

The B-LST analyses have been conducted on the NEC SX-9 machine at the High 
Performance Computing Center Stuttgart (HLRS) of the University of Stuttgart 
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using a single CPU. A parallel code version of the solver BIGSTAB has not been 
implemented satisfactorily so far. Solving a typical temporal eigenvalue problem 
in a crossplane of N y x /V, = 61x110 takes about 1950s for 50 eigenvalues in 
the vicinity of a specified location in the complex plane. Time consumption of the 
iterative solver implemented in the ARPACK library is dependent on the number of 
computed eigenvalues, but mainly on the clustering of eigenvalues around the given 
location of interest. The average vector length is 248 at a vector operation ratio 
of about 99.6%; 58.5 GFLOPS are achieved. The maximum memory needed is 
roughly 60 GB. A further reduction of the memory requirements as well as the time 
consumption may be achieved by taking advantage of the sparse-matrix structure of 
the stability problem. 

The eigenvalue tracking procedure compares the eigenvectors of subsequently 
solved eigenvalue problems. The additional cost of this comparison is negligible 
in relation to the EVP solution. Thus, the computational time for an eigenvalue 
tracking roughly equals the number of tracking steps multiplied by the time needed 
to solve one EVP. However, the tracking process requires slightly more memory for 
the eigenvector comparison routine. 

Porting of the B-LST code to the new CRAY XE6 is pending, and inherently 
connected to its parallelization applying the PARPACK library as well as taking 
advantage of the sparse matrices. 


6.2 Direct Numerical Simulation 

6.2.1 General Performance Analysis for NS3D on CRAY XE6 

The direct numerical simulations have been conducted on the CRAY XE6 at HLRS 
applying our NS3D code. The computational grid is decomposed into an arbitrary 
number of domains along streamwise and wall-normal direction. Each domain 
represents one MPI process. Shared-memory parallelization is applied in spanwise 
direction. 

In the line of porting the NS3D code from the afore primarily used 
NEC SX-9 to the current CRAY XE6 system, explicit finite differences (EFDs) 
have been implemented in order to speed up the computation of derivatives 
for a large number of domains. This approach reduces the amount of necessary 
MPI communication significantly. The gain in performance (for a 2-d flow case) 
compared to the more accurate but costlier compact finite differences (CFDs) 
is shown in Fig. 12 for up to 4,096 MPI processes with domain decomposition 
along the streamwise direction. For the strong-scaling test using the CRAY XE6 
installation step 0, 1.28 million grid points ( N x x N y = 25,600 x 50) have been 
distributed to 16, 32, 64, 128, 256 and 512 MPI processes with domain sizes from 
N x x N y = 1,600 x 50 to N x x N y = 50 x 50 per MPI process. The superlinear 
speed-up using EFDs is due to cache effects. The CFDs’ speed-up suffers from 
the huge amount of communication required for large numbers of MPI processes. 
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Fig. 12 Speed-up for strong scaling (left) and parallel efficiency for weak scaling (right) of 
the NS3D code (2-d flow case) on CRAY XE6 at HLRS as a function of the number of MPI 
processes. Thick solid line: eighth-order explicit finite differences (EFD-08). Dashed line: sixth- 
order compact finite differences (CFD-06). Thin solid line (left) represents ideal speed up 




For the weak-scaling test using the CRAY XE6 installation step 1, the domain per 
MPI process has been fixed to N x x N y = 50 x 50 and the number of processes 
increased. The parallel efficiency of EFDs based computations persists at a high 
level, above 95 %, whereas the efficiency of CFDs based computations drops to 
about 6.6 % for 4,096 MPI processes, again suffering from the amount of inter¬ 
domain communication. 

S. Andersson from CRAY Inc. extended the weak-scaling test for the EFDs to a 
larger number of cores, and a 3-d symmetrical flow case. This time each MPI process 
(N x x N y x N : — 20 x 100 x 33) hosted 8 OpenMP threads. Andersson showed 
that for 14,000 MPI processes in streamwise direction (112,000 cores) NS3D yields 
a parallel efficiency of 94 % with respect to the performance of 100 MPI processes 
(800 cores). 


6.2.2 Performance Data for Current Work Results 

The current-work computations have been done applying the compact finite differ¬ 
ences version of the NS3D code for the asymmetrical flow case. For a typical base 
flow computation the computational grid of 1,540 x 150 x 96 points in streamwise 
C x ), wall-normal (y) and spanwise (z) direction has been decomposed into 154 
domains of size N x x N y x N z = 70 x 30 x 96. On each node one MPI process 
has been started representing one sub domain. Thirty-two OpenMP threads per 
node accounted for spanwise parallelization resulting in a total of 4,928 CPUs. This 
configuration achieved 45.1 |xs per grid point and timestep. For simulations with 
unsteady disturbance input the size of the computational domain has been reduced 
to no. of domains x N x x N y x N- = 105 x 70 x 30 x 96 which yields a total 
of 3,360 CPUs. This reduction increased the code’s performance to 35.9 |is per grid 
point and timestep. If the configuration is chosen according to the one applied by 
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Stefan Andersson - four MPI processes reside on one node, each using 8 OpenMP 
threads - the specific time per grid point and timestep can be reduced to 26.6 |is. 

Similar asymmetrical cases (no. of domains x N x x N y x N z = 8 x 110 x 
150 x 64) executed on the NEC SX-9 achieved between 0.95 and 1.5 |is per grid 
point and timestep in single-node mode (8 CPUs with 11.2 GFLOPS each) and 
multi-node mode (2 nodes, each of the 8 MPI processes hosting 4 Microtasking 
threads, 7.3 GFLOPS per CPU), respectively. A single-node symmetrical case 
(no. of domains x N x x N y x N z = 8 x 110 x 300 x 33) achieved 0.66 |xs per grid 
point and timestep with 16.1 GFLOPS per CPU. 


7 Conclusions 

A laminar hypersonic flat-plate boundary-layer flow altered by a discrete oblique 
surface roughness element with a height of half the thermal boundary-layer 
thickness has been analyzed regarding its stability properties. Cold wind-tunnel 
conditions have been considered and compared to recent hot-flow results from 
[10] and [11]. In the cold flow with an adiabatic wall the roughness’ leading-edge 
main vortex is the dominant flow structure in the wake of the roughness element. 
However, a soon crossflow-vortex-like turnover is inhibited by the adjacent inner 
vortex. The near-wake of the oblique roughness element exhibits stronger gradients 
than for a symmetric roughness configuration, cf. [7, 8]. In the hot flow with a 
strongly cooled wall (T Wrad / T Wad < 0.1), these two vortices gain a larger spanwise 
distance resulting in a separate development. Due to the cool wall the hot-flow case 
exhibits larger gradients near the wall. 

The instability modes found in the cold-flow wake show a lower amplification 
rate compared to the unstable hot-flow modes from [11]. This is in accordance to 
the cold flow’s much lower Re^ value of 434, being roughly l/14th of the hot-flow 
value (Rekk = 6, 000) at equal Rek = 10 4 . Hence Ret is not a similarity parameter. 
The integrally most amplified cold-flow mode gains N = 6.5 at about 250 roughness 
heights downstream the roughness’ center compared to the /V-factor of 9 at about 
110 roughness heights for the most unstable hot-flow mode. The high-order accurate 
base-flow solver leads to larger A-factors: At the farthest downstream position 
considered N = 6.5 instead of 5 with a lower-order solver. Furthermore, the band of 
unstable frequencies is narrower for the cold-flow mode, its growth rate maximum 
shifted to lower non-dimensional values. We note that the hot-flow case has been set 
up for comparison at identical values of Res T and k/S t, but is fundamentally more 
unstable due to the strong wall cooling. The cold wall does not seem to stabilize the 
roughness induced instability though its character is of vorticity (first-mode) type 
rather than acoustic (second-mode) type. 

The separated boundary layer in front of the roughness and the coinciding 
local adverse pressure gradient act as a local disturbance amplifier. Along the 
roughness, disturbances are strongly damped due to flow acceleration. Both facts are 
in accordance to [15]. In contrast to the flow behind the two-dimensional element 
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of [15] the stability properties of the flow behind the three-dimensional roughness 
do not resume the smooth-plate values. The induced longitudinal vortices and the 
subsequently generated persistent velocity streaks imprint three-dimensionality on 
the flow. 
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Effect of Wall Roughness Seen by Particles 
in Turbulent Channel and Pipe Flows 


Michael Breuer and Michael Alletto 


Abstract In the present contribution it is shown in the context of large-eddy 
simulation (LES) and a Lagrangian treatment of the disperse phase that it is possible 
to considerably improve the particle statistics in turbulent channel and pipe flows 
by adopting a recently published wall roughness model for the solid phase. First, 
the model presented by Breuer et al. (Int J Multiphase Flow 43:157-175, 2012) is 
evaluated by means of the experiments conducted for a turbulent channel flow by 
Kussin (Experimentelle Studien zur Partikelbewegung und Turbulenzmodifikation 
in einem horizontalen Kanal bei unterschiedlichen Wandrauhigkeiten. Ph.D. thesis, 
Martin-Luther-Universitat Halle-Wittenberg, Germany, 2004) and Kussin and 
Sommerfeld (Exp Fluids 33:143-159, 2002). As a second test case the experiments 
of Boree and Caraman (Phys Fluids 17:055108-1-055108-9, 2005) carried out 
for a turbulent pipe flow were used. For both setups involving rough walls good 
agreement between experiment and simulation is achieved by considering the effect 
of the wall roughness on the particle motion. Especially the latter configuration 
is the precondition for an improved simulation of the complex particle-laden 
turbulent flow in a combustion chamber reported in the last issue (Breuer and 
Alletto, High Performance Computing in Science and Engineering ’ll. Springer, 
Berlin/Heidelberg, 2012) and in Alletto and Breuer (Int J Multiphase Flow 45: 
70-90, 2012) and Breuer and Alletto (Int J Heat Fluid Flow 35:2-12, 2012). 
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1 Introduction 

Owing to complex phenomena in technically relevant configurations involving 
particle-laden turbulent flows, e.g., massive recirculation regions in combustion 
chambers with bluff-body stabilized flames [5] or, e.g., the variety of secondary 
flows (e.g., precessing vortex cores, recirculation regions) in a cyclone separator 
[27], it is indispensable to have (i) adequate numerical tools able to reliably predict 
such complicated flows and (ii) to generate appropriate inflow conditions which 
closely resemble the real conditions at the domain boundary. Condition (i) can be 
achieved by means of LES as already shown by a variety of investigations [6, 7]. 
The effort of the research study carried out recently was to fulfill condition (ii) for 
particle-laden inflow conditions at high mass loadings. 

Former investigations of Breuer and Alletto [8-10] and Alletto and Breuer [1] 
have shown that the generation of inflow conditions by means of a particle-laden 
turbulent pipe flow assuming specular reflection of the disperse phase at the walls, 
led, especially for high mass loadings, to discrepancies between the measurements 
closely downstream of the entrance of the combustion chamber studied. The mean 
particle velocities were found to be faster and the particle velocity fluctuations 
lower than observed in the experiments of Boree et al. [5]. As shown by previous 
investigations [3,22,32,34,38] modeling the effect of the wall roughness can correct 
the aforementioned discrepancies between simulation and experiment. Furthermore, 
the experiments of Boree et al. [5] showed a considerably flatter mean fluid velocity 
profile for the high mass loading rj= 110 % than the unladen flow. Vreman [38] 
reported in his DNS of a four-way coupled turbulent pipe flow that this effect 
can be reproduced by the interaction between particles hitting a rough wall and 
the continuous phase. Hence, the effort of our recent research was to develop a 
model which mimics the rebound behavior of solid particles at rough walls. In 
contrast to previous models found in the literature [18, 33, 37, 38] the sandgrain 
roughness model recently published in Breuer et al. [11] has the big advantage 
to establish an explicit relation between the carpet of densely packed spheres 
modeling the rough asperities and commonly used roughness parameters such as 
the mean roughness height R : or the root-mean-squared roughness R q . Kussin and 
Sommerfeld [21,22,35] reported the only investigations known to the authors which 
explicitly investigated experimentally the roughness effects on turbulent particle¬ 
laden flows with a clear specification of the mean roughness R z of the bounding 
walls. In order to validate the new model a variety of tests were performed in a 
turbulent channel flow (see [11]). The results showed good agreement between the 
simulated particle statistics and the experiments [21, 22, 35]. 

In the following we briefly summarize the results obtained for the channel 
flow. Furthermore, the results achieved by adopting the new model to a turbulent 
particle-laden pipe flow are shown and compared with the experiments of Boree 
and Caraman [4]. This study are preliminary investigations in order to generate 
better inflow conditions for the time-consuming simulations in a model combustion 
chamber already described in the previous report by Breuer and Alletto [10] and 
additionally in [1,9]. 
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2 Governing Equations 


In this work the multiphase flow is described using an Euler-Lagrange approach in 
which the two different phases are solved in two different frames of reference. 

The continuous phase is solved in an Eulerian frame of reference. The conser¬ 
vation equations of the filtered quantities used in LES [7] can be extended to take 
into account the feedback effect of the particles on the fluid (two-way coupling). 
For that purpose the particle-source-in-cell method described in Crowe et al. [15] is 
used. 

The dispersed phase is solved in a Lagrangian frame of reference. The equation 
of motion is given by Newton’s second law, where the fluid forces are derived from 
the displacement of a small rigid sphere in a non-uniform flow [25]. For particles 
with a density much higher than the carrier fluid, i.e. p p /pf 1, only the drag, lift, 
gravity and buoyancy forces have to be considered, leading to: 



u p , u/, r p , g, m p and oo p are the particle velocity, the fluid velocity at the particle 
position, the particle relaxation time, the gravitational acceleration, the mass of the 
particle and the particle angular velocity. The lift force on a particle due to the 
velocity shear (Saffman force) is calculated as follows [26]: 
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G is the fluid velocity gradient tensor. The lift force on a particle due to rotation 
(Magnus force) F^ ag is calculated as follows [16]: 
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The relative rotation of the particles St,-ei is given by: 


^rel — ^^2<Uy Wp. (5) 

The lift coefficient Clr is based on an experimentally determined correlation for 
Re p < 140 [28]: 
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The Reynolds number of particle rotation Re, and the particle Reynolds number 
Re p are defined as: 


P f d n I ^re/| 


p/|u/ ~ Up\ dp 

R-f 


(7) 



Here d p denotes the particle diameter. The drag force on the particle is based on 
Stokes flow around a sphere, where the corresponding drag coefficient is given by 
Cd = 2Aa/Re p with the correction factor a = 1 + 0.15 Re®' 687 to extent the validity 
towards higher Re p . 

For tiny particles with a relaxation time of the same order as the smallest fluid 
time scales the unresolved scales in LES become important for the particle motion. 
To consider the effect of the subgrid scales, a simple stochastic model by Pozorski 
and Apte [29] is applied. It requires the estimation of the subgrid-scale kinetic 
energy carried out with the help of the scale similarity approach of Bardina et al. [2]. 
To account for the rotation of the spherical particles around three Cartesian axes, 
Newton’s second law for the angular momentum (2) is considered. The torque acting 
on a rotating spherical particle is determined based on the formulation of Rubinow 
and Keller [30], 


3 Numerical Methods 

The continuous phase is solved in an Eulerian frame of reference using the computer 
code CSSOCC (= Large-Eddy Simulation On Curvilinear Coordinates [6,7,12]) 
to integrate the governing equations in space and time. It is based on a 3-D 
finite-volume method for arbitrary non-orthogonal and block-structured grids. The 
entire discretization is second-order accurate in space and time. For modeling the 
non-resolvable subgrid scales the well-known Smagorinsky model [31] with Van 
Driest damping near solid walls is applied. Alternatively, a dynamic model can be 
used. 

The ordinary differential equation (1) for the particle motion is integrated by 
a fourth-order Runge-Kutta scheme. To avoid time-consuming search algorithms, 
the second integration of Eq. (1) to determine the particle position is done in the 
computational space. Here an explicit relation between the position of the particle 
and the cell index containing the particle exists [12, 14], which is required to 
calculate the fluid forces on the particle. Thus a highly efficient particle tracking 
scheme results allowing to predict the path of millions of particles. 

The fluid velocity u / at the particle position is calculated with a Taylor series 
expansion around the cell center next to the particle [24]. This interpolation was 
shown to have a weaker filtering effect on the fluid velocity than a trilinear 
interpolation leading to better results for particles with small relaxation times r p . 
The set of three linear ordinary differential equations (2) for the particle angular 
velocity are solved analytically. 
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The collisions between particles for a four-way coupled simulation are deter¬ 
mined deterministically by a recently developed collision method described in detail 
in Breuer and Alletto [9]. Based on the technique of uncoupling the calculation of 
particle trajectories is split into two stages. In the first stage particles are moved 
based on the equation of motion without inter-particle interactions. In the second 
stage the occurrence of collisions during the first stage is examined for all particles. 
If a collision is found, the velocities of the collision pair are replaced by the post¬ 
collision velocities without changing their position. The post-collision velocities are 
calculated by a hard sphere collision model involving friction between the colliding 
spheres (see e.g., [16,36]) where changes of the velocity and angular velocity of the 
two involved particles are modeled by the normal coefficient of restitution e, up , the 
tangential restitution coefficient e t , p , the static /i st p and the dynamic coefficient of 
friction l-idy.p- 

If the particle passes the center of the cell adjacent to the wall, a recently 
developed model (Breuer et al. [11]) to mimic the rebound behavior of the particle 
at rough walls is applied. The model takes into account the momentum loss of the 
particle during the wall impact by the wall-normal restitution coefficient e„ <w , the 
tangential restitution coefficient e t , w , the static coefficient of friction fi st w and 
the dynamic coefficient of friction [idy,w Furthermore, the model considers the 
shadow effect leading to a redistribution of the streamwise momentum toward the 
wall-normal momentum. Based on geometric considerations relying on generally 
used roughness parameters such as R z and R q the local inclination of the wall is 
determined. 


4 HPC Strategies 

The major time-consuming computations carried out on the NEC SX-9 were 
spent to further improve the statistics of the flow in a model combustion chamber 
published in [1,9,10]. Concerning the HPC aspects the optimization of the solver for 
the continuous phase in CESOCC was already reported in several previous reports, 
see, e.g., [13,23]. Thus, here we restrict the discussion to the solver for the dispersed 
phase. 

In order to track a huge number of particles, it is important to work with 
efficient algorithms which are applicable on high-performance computers. The 
present scheme is highly efficient due to the following reasons: 

• No CPU time-consuming search algorithm is needed in the present c-space 
scheme. 

• The particle properties are stored in linear arrays which allows vectorization of all 
loops in the particle routines over the total number of particles on the processor. 
If the number of particles is reasonably large (e.g., >256) the loops are efficiently 
carried out on the vector unit. 
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• Even if particles are leaving the present domain or deposit, the linear arrays 
are kept filled by reordering the particles after each time step. This guarantees 
optimal performance. 

• The multi-block exchange between blocks for the particle data completely relies 
on the same arrangement as used for the continuous phase. The data transfer itself 
is based on MPI. 

• Parallelization of the particle routines (and also of the flow solver) is achieved 
by domain decomposition, i.e., each processor deals with the particles of its own 
block. A minor disadvantage of this procedure is the fact that no load balancing 
of the particle tracking is possible, since the distribution of particles is not known 
in advance. However, since the tracking is so efficient that the predominant part 
of the CPU time is still spent for the continuous phase, this imbalance observed 
for the particle routines does not preponderate for the overall load balancing of 
the entire code. 

• The collision detection procedure is carried out over a small amount of particles 
contained in a virtual cell which breaks down the computational cost from 0(N p ) 
to 0(N P ). 

• As described in detail in [10] several conditions are introduced to further reduce 
the number of potential colliding particles step by step. At the end only a very 
few potential colliding particles remain, which have to be evaluated completely. 

• Vectorization of the most time-consuming loop of the collision check routine 
is achieved by splitting up the loop. Additionally, it is ensured that the loop is 
reasonably large to be efficiently carried out on vector units. 

Regarding more details about the specific topic of particle collisions, we refer to 

the previous report [ 1 0] and [9]. 


5 Description of the Test Cases 
5.1 Horizontal Channel Flow 

Kussin and Sommerfeld [22], Kussin [21], and Sommerfeld and Kussin [35] carried 
out detailed measurements based on phase-Doppler anemometry in a particle-laden 
horizontal fully developed channel flow. Three different Reynolds numbers were 
studied experimentally. In the present study a value of Re = Sci,Ub/v / = 21,292 
was chosen (Re r = SchU z /vf = 946). Sci, denotes the channel half width, Ub the 
bulk velocity and u T the friction velocity. The gravitational acceleration g was 
made dimensionless by Ub and Sch (g* = g&ch/Ul = — 4.42 x 10” 4 ) and pointed 
towards the bottom wall, i.e., in negative y-direction. The particles were spherical 
glass beads (p p = 2,500 kg/m 3 ) with mean diameters in the range of 60 |im to 
1 mm subdivided into several classes of nominal size. In the present simulations 
mono-disperse particles of a nominal diameter of d p = 60, 100 and 195 |im 
were considered characterizing a small size class and two medium size classes 
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(St = ppdpUs/ (18/x f8ci,oi) = 18.8, 41.4and 100.4). Exchangeable stainless steel 
plates allowed to investigate the effect of different degrees of the wall roughness. 
From all cases studied in [21, 22] the cases R0, R1 and R2 corresponding to a 
mean roughness of R z = 2.32, 4.26 and 6.83 pm were investigated here for the 
particle size of d p = 195 p m allowing a detailed evaluation of the behavior of the 
wall roughness model at a mass loading of r] = 30 %. 

Furthermore, the influence of inertia on the particle dynamics was analyzed by 
examining the behavior of particles with diameters of d p = 60, 100 and 195 p m in 
the R2 channel configuration. The mass loading was set to r) = 10 % for the particles 
with a diameter of d p = 100 and 195 pm. For the 60 pm particles a mass loading of 
rj = 12 % was considered, since no measurements were available for rj = 10 %. 

The parameters modeling the velocity change during a flat wall impact were 
set to constant values for the cases chosen to evaluate the model: e„ w = 0.9, 
e t , w = 0.3, Hdy,w = 0.4, ix st ' W = 0.5 and Cs u rface = 3. Note that for the channel flow 
only frictionless collisions between the particles were considered with e np = 0.97 
and the Magnus lift force was neglected since it was not implemented at this stage 
of the development of the code. 

The computational domain was 2n8ch x 2 8ch x tr 8ch in streamwise, 
wall-normal and spanwise direction, respectively. The grid employed had 
128x 150x 150 cells and the dimensionless time step was At* = 0.004. In 
streamwise and spanwise direction an equidistant grid was used. In wall-normal 
direction the grid was stretched geometrically with a stretching factor r = 1.06 and 
the first cell center was located at Ay + = 0.8 for the unladen flow allowing to apply 
Stokes no-slip boundary conditions. Furthermore, periodic boundary conditions 
were applied in streamwise and spanwise direction. 


5.2 Vertical Pipe Flow 

Boree and Caraman [4] obtained two-component phase-Doppler anemometry 
measurements of a dilute poly-disperse two-phase flow at the exit of a long 
aluminum pipe. Unfortunately, no roughness specification was given for the pipe. 
The radius was R p j pe = 10mm, the bulk velocity was U plpe = 3.4m/s, the centerline 
velocity was U c = 4 m/s and the Reynolds number based on the bulk velocity and 
the pipe radius was Re p , p( . = 2,253. The gravity pointed in flow direction. The initial 
diameter distribution of the spherical glass beads ranged from d p = 37 to 116 p m 
(see Table 1) and had a density of p p = 2,470kg/m 3 . The values in Table 1 are 
taken from the discrete number distribution found in [4], The bidispersed particle 
number distribution showed two peaks at d p « 60 pm and d p sa 90 pm. Boree and 
Caraman [4] grouped particles with a diameter 55 pm < d p < 65 pm to the 60 pm 
size class and particles with a diameter of 85 pm < d p < 95 pm to the 90 pm 
size class. The particle statistics in the experiment were evaluated for this two size 
classes. Two different mass loadings of r\ = 11 and 110% were considered in the 
experiment. 
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Table 1 Initial distribution of the particle sizes by Boree and Caraman [4] 


i 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

d p ,i (Mm) 

36.9 

42.7 

55.8 

61.1 

67.0 

74.1 

87.8 

97.1 

107 

116 

Number distribution 

0.019 

0.044 

0.120 

0.107 

0.077 

0.093 

0.219 

0.186 

0.102 

0.033 


In this case the parameters modeling the velocity and angular velocity changes 
during the wall impact were set to the following values: e nw = 0.831, e tM = 0.31, 
fid y , w = 0.125 and fi s t,w = 0.6. The parameters chosen for the hard sphere collision 
model were set to e„^ p = 0.97, e t , p = 0.44, fid y , P = 0.092 and fi s t, P = 0.94. These 
values were taken from Foerster et al. [17] for the pairings glass-glass (particle- 
particle collisions) and aluminum-glass (particle-wall collisions). To evaluate the 
roughness model two equivalent sandgrain roughnesses k s = C.suriacc R : = 0.0 and 
15.0 |im and additionally a specular wall collision (i.e., only the wall-normal 
velocity after the wall impact was changed with e, uw = 1 and without friction during 
the collision) were chosen. Note that the chosen roughness height lies in the range 
given in the literature for aluminum pipes ( k s = 1.5 — 60 |im [19]). To evaluate the 
simulated particle statistics the particles with the size classes i = 3 — 5 (see Table 1) 
were compared with the measurements of the 60 |xm size class found in [4] and 
particles with the classes i = 1 — 8 were compared with the 90 |xm size class of the 
measurements [4], 

The computational domain had a dimensionless radius R / R p i pe = 1 and an 
extention of L/R p i pe = 24 in streamwise direction for the low mass loading case 
and L / R p j pe = 12 R for the high mass loading. The longer computational domain for 
the low mass loading was chosen to ensure that the two-point correlations reached 
values close to zero at the half of the pipe length. The computational domain was 
discretized by an O-type grid with 2 x 10 6 cells for the low mass loading and 1 x 10 6 
for the high mass loading. The first cell center was placed at y + =0.14 for the 
unladen flow and thus no-slip boundary conditions were applied. The flow is again 
periodic in streamwise direction. 


6 Results 

6.1 Horizontal Channel Flow 

In the following the behavior of the wall roughness model with increasing asperity 
heights is presented. For a detailed explanation of the governing mechanism in a 
particle-laden flow confined by rough walls we refer to Breuer et al. [11], Here only 
a brief summary of the observations is provided. 

The results obtained for R0, R1 and R2 are compared with the experiments 
[21]. Furthermore, the special case of a smooth wall is included. Figure la 
shows the mean streamwise particle velocity which is in good agreement with the 
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Fig. 1 Variation of the wall roughness: statistical results of the plane channel flow for the particles 
d p = 195 |im, fj = 30%, = 0.9, = 0.4: (a) mean velocity, (b) streamwise r.m.s. 

fluctuations, (c) wall-normal r.m.s. fluctuations, (d) normalized particle number concentration 


measurements [21]. It is clearly evident that the mean particle velocity decreases 
with increasing roughness. 

Figure lb shows the streamwise particle velocity fluctuations. The r.m.s. fluctu¬ 
ations increase with increasing wall roughness. The simulated particle wall-normal 
velocity fluctuations illustrated in Fig. lc show an almost linear increase with 
R : . It is especially remarkable how the normalized particle number concentration 
(Fig. Id) is influenced by the roughness: While a smooth channel wall leads to an 
accumulation of the particles at the bottom wall, the computed concentration for the 
roughness R2 delivers an almost uniform profile. Furthermore, there are consistent 
changes in the concentration profiles if the presented wall model is applied to 
the R0 roughness rather than assuming a smooth wall. Good agreement with the 
experiments is observed except in the near-wall regions where the computed profiles 
show a strong peak at the bottom wall. A possible explanation of this observation is 
that especially for particles reaching the wall with a small incident angle, the present 
wall model could lead to a non-zero probability of the rebounded particles to remain 
grazing, i.e., to rebound with a very small or zero wall-normal velocity (see Konan 
et al. [20]). Such particles may experience a multiple rebound with the rough wall 
[20] presently not included in the wall model. 
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Fig. 2 Variation of d p : statistical results of the plane channel flow for the particles at r] rs 10 %, 
e„.„ = 0.9, [i-dyy, = 0.4, roughness R2: (a) mean velocity, (b) streamwise r.m.s. fluctuations, 
(c) wall-normal r.m.s. fluctuations, (d) normalized particle number concentration 


Figure 2 depicts the particle statistics obtained for the R2 channel configuration 
varying the particle diameter d p . The mass loading was set to rj = 10 % ( d p =100 
and 195 |im) or // = 12 % (d p = 60 p,m). The influence of inertia is clearly visible 
considering the mean particle velocity in Fig. 2a: Even though the 60 p,m particles 
are much stronger influenced by the roughness structures in the near-wall region, 
they adjust much quicker to the fluid flow leading to a higher mean velocity 
compared with the 195 |xm particles (see also [36]). Good agreement with the exper¬ 
iments of Kussin [21] is found for all cases. Figure 2b shows the particle streamwise 
velocity fluctuations which also reasonably agree with the experiments [21], The 
wall-normal velocity fluctuations (Fig. 2c) computed for the small 60 ptm particles 
are in close agreement with the experiments, whereas for the larger particles the 
fluctuations are underpredicted. Similar to the wall-normal velocity fluctuations no 
specific trend can be observed for the normalized particle concentration (Fig. 2d). 
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6.2 Vertical Pipe Flow 

In the following results of a turbulent particle-laden pipe flow adopting the model 
presented in Breuer et al. [11] are compared with the experimental data of Boree 
and Caraman [4]. Additionally, the DNS data obtained by Vreman [38] in the same 
pipe are included for comparison. For the low mass loading Vreman [38] presented 
only simulations with mono-disperse 60 |im particles and thus only the unladen 
flow results were taken for comparison. For the high mass loading rj = 110% 
Vreman [38] carried out mono-disperse simulations with 90pm particles. The 
results achieved are not expected to differ substantially from the poly-disperse 
scenario (87 % of the particle mass belongs to the 90 p,m size class [4]) and hence 
can be compared with the present simulation. Note that the data of Vreman [38] 
are obtained with a simpler model for the influence of rough walls on the particle 
rebound. Unfortunately, Boree and Caraman [4] did not provide fluid measurements 
for the high mass loading and hence these could not be included. 

Figure 3 shows the mean streamwise particle ( d p = 60 and 90 |im) and fluid 
velocity for the two mass loadings investigated for different wall roughnesses. 
The following observations can be made: Adopting the sandgrain wall roughness 
model leads to (i) a reduction of the mean particle velocity in the pipe center and a 
slight increase in the near-wall region, (ii) the 90 |xm particles (see Fig. 3b, d) are 
stronger influenced by the wall roughness than the 60 |xm particles (see Fig. 3a, c), 
(iii) deviations from the experiments can be seen for all particle statistics in the 
near-wall region (see Fig. 3a-d), (iv) the mean fluid flow is only affected in the high 
mass loading case showing similar to the DNS of Vreman [38] a reduction of the 
mean fluid velocity in the pipe center. The wall roughness leads to an increase of 
the particle concentration (not shown for the sake of brevity) in the pipe center and 
hence due to the four-way coupling to a force on the fluid acting against the flow 
direction. (Note that the particles in the pipe center are slower than the fluid.) (v) By 
adopting the sandgrain roughness model good accordance with the experiments [4] 
can be found except in the near-wall region. 

Figure 4 shows the streamwise particle ( d p = 60 and 90 p.m) and fluid velocity 
fluctuations for the two mass loadings investigated for different wall roughnesses. 
It is obvious that for all particle statistics an improvement (Fig. 4a-d) can be seen 
if the roughness model [11] is adopted. For the low mass loading (Fig. 4a, b) the 
model leads to an increase of the r.m.s. fluctuations in the pipe center, whereas 
they remain nearly unchanged close to the wall. Concerning the high mass loading 
(Fig. 4c, d) the model yields an increase of the r.m.s. fluctuations for the 60 |im 
particles, whereas for the r.m.s fluctuations of the 90 |im particles only small 
changes throughout the pipe can be reported. Remarkable is the strong reduction of 
the fluid velocity fluctuations from the low mass loading (Fig. 4e) to the high mass 
loading (Fig. 4f). Note that in his DNS Vreman [38] made the same observation. 

Figure 5 shows the wall-normal particle ( d p = 60 and 90 |xm) and fluid velocity 
fluctuations for the two mass loadings investigated for different wall roughnesses. 
As for the other statistics shown, the sandgrain roughness model leads to a 
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(a) r] = 11%, d p = 60 /im 


(b) 7! = 110%, d p = 60 fj.m 



(c) rj = 11%, d v = 90 fim 



(d) rj = 110%, d p = 90 fi m 
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Fig. 3 Variation of the wall roughness: statistical results of the poly-disperse pipe flow; (a)-(d) 
mean particle velocity for two particle diameters and mass loadings; (e)-(f) mean fluid velocity 


considerably better agreement with the experiments than the specular wall or 
the smooth wall boundary conditions. Remarkable is the strong enhancement of 
the wall-normal particle velocity fluctuations for both mass loadings and particle 
classes evaluated (Fig. 5a-d). This indicates that by means of the wall roughness the 
particles achieve a steeper trajectory in a similar manner as the particles rebounding 
at the rough walls in the channel flow (see Fig. lc and also [11]). Analogous to 
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Fig. 4 Variation of the wall roughness: statistical results of the poly-disperse pipe flow; 
(a)-(d) particle streamwise r.m.s. fluctuations for two particle diameters and mass loadings; (e)-(f) 
streamwise fluid r.m.s. fluctuations 


the streamwise fluid velocity fluctuations also in the wall-normal direction a strong 
reduction of the fluid velocity fluctuations from the low mass loading (Fig. 5e) to 
the high mass loading (Fig. 5f) can be observed. Note that the same observation was 
again made by Vreman [38]. 















290 


M. Breuer and M. Alletto 




(a) r) = 11%, d p = 60 fim 


(b) r] = 110%, d p = 60 /im 




(c) r] = 11%, dp = 90 /im 


(d) 7] = 110%, = 90 /izm 



£ 


0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

0 


specular wall flow 
k 3 = 0 /im flow 
fc s = 15 |im flow 
Vreman [38] rj = 110% flow 


(e) rj = 11%, flow 


-0.5 0 

^ / Rpipe 

(f) r) = 110%, flow 


Fig. 5 Variation of the wall roughness: statistical results of the poly-disperse pipe flow; (a)-(d) 
particle wall-normal r.m.s. fluctuations for two particle diameters and mass loadings; (e)-(f) fluid 
wall-normal r.m.s. fluctuations 


7 Conclusions 


In view of the results obtained in Sect. 6 some analogies between the behavior of 
solid particles bounded by rough walls in a turbulent channel and pipe flow can be 
found: 
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• For both configurations the wall roughness leads to a reduction of the mean 
particle velocity with increasing roughness height which implicates an additional 
pressure drop due to the four-way coupling. 

• An enhancement of the particle wall-normal fluctuations could be found indi¬ 
cating steeper particle trajectories and an additional momentum transfer between 
the streamwise and the wall-normal direction. 

• Both effects were more pronounced for bigger particles than for smaller particles 
since big particles keep memory of the wall impact for a longer time than small 
particles. 

• An improvement of the particle statistics in both configurations can be achieved 
by considering the rebound behavior of solid particles at rough walls applying 
the recently developed sandgrain roughness model. 

Regarding the particle concentration differences can be found between the 
horizontal channel and the vertical pipe. In a channel flow the wall roughness leads 
to a homogenization of the particle concentration. (Note that gravity pointed towards 
the bottom wall.) In the pipe flow the wall roughness induces an accumulation of 
the particles in the pipe center (not shown for the sake of brevity) leading for the 
high mass loading to a reduction of the mean fluid velocity in this region by means 
of the four-way coupling. 

In conclusion, with the new findings it is possible to substantially improve the 
inflow conditions of the combustion chamber flow described in the former report of 
Breuer and Alletto [10] and in [1,9]. 

Additionally, for the chamber flow a new grid with a finer resolution is presently 
used to better resolve the complicated flow in the two recirculation regions 
extending behind the chamber entrance. Both measures are expected to considerably 
improve the results of the combustion chamber flow. 
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Large-Eddy Simulation of Supersonic Film 
Cooling at Incident Shock-Wave Interaction 


Martin Konopka, Matthias Meinke, and Wolfgang Schroder 


Abstract The impact of shock waves on supersonic cooling films is studied using 
large-eddy simulations (LES). A laminar cooling film is injected through a slot at 
a Mach number Ma, = 1.8 into a fully turbulent boundary layer at a freestream 
Mach number Ma^ = 2.44. The cooling film is disturbed by oblique shock waves 
at deflection angles of 5° and 8° at two downstream positions of the slot. At 
shock impingement close to the slot, i.e., within the potential-core region, at a flow 
deflection of 5°, a cooling effectiveness decrease of 33 % occurs downstream of 
the separation bubble compared to a configuration without shock impingement. 
If the same shock impinges further downstream upon the boundary-layer region, 
the decrease is only 17 %. The stronger 8° shock wave at the further downstream 
impingement position leads to a maximum decrease of 33 %. The current report 
presents a concise version of Konopka et al. (4th European conference for aerospace 
sciences, 2011). 


1 Introduction 

In supersonic combustion ramjets (Scramjets) shock waves are present in the 
isolator and combustion chamber. These shock waves interact with cooling films 
if supersonic film cooling is used to protect the engine’s interior surfaces from 
the intense aerodynamic heating and hot combustion products. This film cooling 
problem, i.e., the interaction of shock waves with a supersonic cooling film injected 
through a slot, was investigated experimentally for turbulent flows [2-4] to assess 
the impact on heat transfer and cooling effectiveness. More recent experimental 
studies performed by Kanda and Ono [5] and Kanda et al. [6] on film cooling 
with shock wave interaction found that shock waves have only little effects on 
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Fig. 1 Main flow features of 
shock-wave impingement 
upon the potential-core region 
(case II) 



cooling effectiveness. Kanda and Ono and Kanda et al. showed that the cooling 
effectiveness is mainly reduced by the increased wall-recovery temperature which 
is caused by the reduced local Mach number downstream of the shock wave. 
However, Peng and Jiang [7] found in their computational study using the Reynolds 
averaged Navier Stokes (RANS) equations with the Menter SST [8] turbulence 
model that the mole fraction of the injected gas decreased at the impingement 
position of the shock waves. This indicates that increased mixing due to the excited 
turbulence levels in the cooling flow occurred additionally to the effect of the 
reduction of the local Mach number. Therefore, the following study focuses on 
whether the reduction of cooling effectiveness by shock waves is only a function 
of the local Mach number or if increased turbulence plays a role. Five cooling 
configurations at the same injection condition are investigated. A zero pressure 
gradient configuration (case I) is compared to experiments of Juhany et al. [9]. 
Two configurations where a shock wave generated by a flow deflection angle of 
5° at varying shock-impingement positions are analyzed. The first impingement 
position is located within the potential-core region [10, 11], where the injected 
cooling flow is laminar (case II). The second position is located further downstream 
in the boundary-layer region [9, 10] (case III), where the cooling flow mixes with the 
freestream and boundary-layer-like velocity profiles occur. Additionally, a stronger 
shock wave generated by an 8° flow deflection impinges at the same downstream 
position within the boundary-layer region upon the cooling flow (case IV). 

The principal flow features of shock-cooling-film interactions within the 
potential-core region (case I) are sketched in Fig. 1. The potential-core region, 
which originates at the slot, is encompassed by the laminar slot boundary layer and 
the mixing layer where the freestream mixes with the cooling flow. On top of the 
mixing layer there is the shear layer which emanates from the lip and is fed by the 
turbulent boundary layer. The incident oblique shock wave crosses these layers and 
causes the laminar separation bubble. Slightly upstream of the separation bubble 
compression waves decelerate the flow, at the crest of the bubble an expansion fan 
emerges and then the flow reattaches creating compression waves which unify and 
form a shock wave. At the end of the separation bubble the laminar slot boundary 
layer undergoes transition.The report is organized as follows. First, the numerical 
method will be presented, subsequently, details of the boundary conditions and 
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the computational mesh will be given. Next, the flow configuration is explained 
and the results section follows. In the results section, the zero pressure gradient 
film cooling configuration is validated by experiments from Juhany et al. [9], then, 
shock-cooling-film interactions are investigated in terms of cooling effectiveness, 
instantaneous flow properties, and turbulence statistics. Subsequently, some details 
on the consumed computational resources are given. Finally, some conclusions are 
drawn. 


2 Numerical Method 

In the past RANS equations using among other approaches k — s turbulence models 
have been used [12-14] to model film cooling problems with varying success, i.e., 
depending on the variant of the model, there was quite a discrepancy in the wall 
temperature distributions. This is caused in part by the modeling of the mixing layer 
between the cooling flow and the freestream, where no satisfying model to account 
for density gradients exist [15]. Therefore, in this study high fidelity turbulence 
modeling is applied, i.e., large-eddy simulations (LES) are performed. 

The Navier-Stokes equations are discretized at second-order accuracy using a 
modified mixed-centered upwind advective upstream splitting method (AUSM) [16] 
for the Euler terms. The discretization of the non-Euler terms is done using a 
centered approximation at second-order accuracy. The temporal integration is done 
by a second-order five-stage low-storage Runge-Kutta method. The non-resolved 
subgrid scales are implicitly modeled using the MILES ansatz [17]. The viscosity 
is evaluated by a power law /Lt//^o = (T/To) 0 ' 72 where Tq denotes the stagnation 
temperature. A detailed summary of the flow solver used in this study is given by 
Meinke et al. [18]. The accuracy of its solutions in fully turbulent flows is discussed 
in [19-21]. The solution algorithm has also shown convincing results in supersonic 
flows involving shock-boundary-layer interactions [22]. 


3 Boundary Conditions and Computational Mesh 

The prescription of realistic inflow variables for compressible turbulent boundary 
layers is a challenge in LES since at every time step a different instantaneous 
inflow distribution is required. This problem can be solved by computing the 
compressible turbulent boundary layer from the leading edge of a flat plate. To avoid 
this computationally costly approach an independent boundary layer simulation is 
performed using the rescaling method proposed by El-Askary et al. [22], which is 
based on Lund et al.’s [23] approach considering compressibility. In the boundary- 
layer domain, which is depicted in Fig. 2, the inflow distribution is generated by 
rescaling the flow variables obtained from a plane within the domain such that a 
constant boundary layer thickness at the inflow is achieved. A slice of the flow field 
is then extracted at every time step and injected into the main film cooling domain. 
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Fig. 2 Sketch of the boundary conditions and the turbulent inflow data generation method 

At the lower wall of the film-cooling simulation domain adiabatic no-slip 
boundary conditions are imposed, at the exit all variables are extrapolated. To avoid 
any spurious oscillations a sponge layer is used at the exit and at the top of the 
boundary-layer domain where the flow variables are driven to the desired target 
variables [22]. At the slot a laminar supersonic cooling flow is prescribed. In the 
spanwise direction, periodic boundary conditions are used. The desired shock-wave 
strength and angle is generated at the upper boundary by setting flow conditions 
which satisfy the Rankine-Hugoniot relations. 

The body-fitted computational mesh consists of 17.1 million grid points with an 
equidistant spacing in the streamwise and spanwise directions. The resolution at the 
wall in inner coordinates is Ax + = 20, Ay + = 0.5, A~ + = 10 in the streamwise, 
wall-normal, and spanwise directions. A grid study for this cooling configuration 
was performed by Konopka et al. [24] where the current resolution was found 
to be adequate. Due to the spanwise domain size of z/S = 2.2 the computations 
do not resolve some large-scale vortices. A similar cooling configuration with a 
lower injection Mach number of Ma, = 1.2 was investigated by Konopka et al. [25] 
and it was found that the spanwise domain size of the present study leads to 
an overestimation of the cooling effectiveness by about 9 % at a zero pressure 
gradient. The overestimation is similar at shock interaction. Therefore, reasonable 
conclusions can be drawn from the current study at the present spanwise domain 
extent. 


4 Flow Configuration 

The freestream and injection flow properties match those used in the experiment 
by Juhany et al. [11]. The freestream Mach number is set to Ma 00 = 2AA and 
the freestream Reynolds number Re 00 = u 00 S/v^ based on the slot height .S', 
the freestream velocity u 0c , and the freestream kinematic viscosity is 

Re 0 o= 13,500. The freestream boundary-layer thickness at the tip of the lip is 
8/S = 2.2. The blowing rate M , the injection Mach number Ma x , and the total 
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Table 1 Flow configuration 


Case 

Ma { 

^ti/ T too 

Ad — P[U{/ Poo^oo 

P 

Ma 3 

-X’imp/ & 

^in 

At ■ UqqO. 03/L[ a 

I 

1.8 

0.76 

0.74 

0 ° 

2.44 

- 

- 

- 

II 

1.8 

0.76 

0.74 

5° 

2.04 

17 

7.2 

1.04 

III 

1.8 

0.76 

0.74 

5° 

2.04 

60 

3.2 

2.58 

IV 

1.8 

0.76 

0.74 

8 ° 

1.82 

60 

5.8 

1.29 


temperature ratio 7j|/ T too of the cooling flow are listed in Table 1 and are kept 
constant at all considered cases. Case I is a zero pressure gradient configuration and 
is validated in Sect. 5.2.3 by the experiments performed by Juhany et al. [11]. The 
shock waves caused by a = 5° flow deflection angle impinge upon the potential- 
core region in an inviscid flow at a downstream distance x/S = 17 (case II) and 
in the boundary-layer region at x/S = 60 (case III). Additionally a stronger shock 
wave is considered impinging at the same downstream distance x/S = 60 upon 
the cooling flow (case IV). At the cooling slot, the boundary-layer thickness of the 
upper and lower laminar supersonic boundary layers are assumed 8/S = 0.07. The 
pressure is set to the freestream pressure poo. 

The duration At of the simulations which is used for the averaging process is 
given in Table 1. The time interval At is normalized by the timescale L m /(O.O^Ur^) 
where L ln is the interaction length, i.e. the distance between the point of separation 
and the location where the shock impinges upon the wall assuming inviscid 
conditions. This time scale is associated with the low frequency motion of the 
shock [26]. The current time averaging window covers at least one complete cycle 
of the shock. 


5 Results 

The results section is divided into two major parts, first, the zero pressure gradient 
film cooling configuration is validated and the length of the potential-core region 
is determined. Then, shock-cooling-film interactions are investigated in terms of 
instantaneous and mean flow properties, cooling effectiveness distributions, and 
turbulence statistics. 


5.1 Zero-Pressure Gradient Configuration (Case I) 

The zero-pressure gradient configuration (case I) is validated in Sect. 5.2.3 by 
comparing numerically obtained cooling effectiveness distributions with the exper¬ 
iments by Juhany et al. [11]. 

To determine the length of the potential-core region for the choice of the shock- 
impingement positions at cases II-IV, streamwise velocity profiles of case I are 
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Fig. 3 Streamwise velocity 
profiles and Reynolds shear 
stress component u"v"/u^ >0 
for case I; grid spacing is 
m/moo = 1 and 

= 0.0035. 

(a) x/S = 0, 15, 30, 40. 

(b) x/S = 0, 15, 30, 40 




shown in Fig. 3a. At x/S = 0, the potential-core region is visible by the constant 
velocity in the slot flow at —1.16 < y/S < —0.16. At increasing downstream 
distance, this region of constant streamwise velocity shrinks since the laminar slot 
boundary layer at the wall and the mixing layer, which emanates from the lip, 
grow and eventually merge. This becomes apparent at the downstream distance 
x/S = 40, where the velocity profile already resembles that of an undisturbed fully 
turbulent boundary layer. The Reynolds shear stress component u"v" /u 2 ^, which is 
composed of the fluctuating velocity components in the streamwise and spanwise 
directions from the Favre-averaged mean, is shown in Fig. 3b. The potential-core 
region is evident by the zero shear stress at x/S = 0 , — 1-16 < y/S < —0.16. 
At x/S = 30, the laminar slot boundary layer has undergone transition which is 
indicated by the negative peak shear stress close to the wall. Flowever, a small area 
with minimal shear stress at x/S = 30, y/S = — 0.4 is still visible, indicating 
a potential flow. At x/S = 40, the slot boundary layer and the mixing layer have 
merged, marking the end of the potential-core region and the beginning of the 
boundary-layer region. 


5.2 Analysis of Supersonic Film Cooling with Shock Waves 

In the following, cases II-IV are considered, i.e., shock-cooling interactions are 
investigated to show the impact on cooling effectiveness and turbulence statistics. 
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Fig. 4 Skin friction 
coefficient and wall pressure 
distribution plotted vs. the 
streamwise distance from the 
slot, (a) Skin friction 
coefficient, (b) Wall pressure 



b 



x/S 


5.2.1 Flow Characteristics of Supersonic Film Cooling with Shock Waves 

Figure 4a shows the skin-friction coefficient distribution vs. streamwise distance 
of all considered cases to clearly identify regions with separated flow. At the 
zero-pressure gradient cooling configuration, the skin friction rises quickly at 
the end of the potential-core region. At the shock-impingement position within the 
potential-core region at case II, a separation bubble at a length L sep /S = 6.5 exists. 
Downstream of the laminar separation bubble the skin-friction coefficient shows a 
pronounced peak, which is a clear sign of the transition of the laminar slot boundary 
layer. At the shock impingement upon the boundary-layer region at case III, a much 
smaller separation bubble is found with a length of L sep /S = 0.9, since the wall- 
bounded flow is turbulent at this downstream position. The greater shock strength at 
case IV due to the higher deflection angle /3 = 8° leads to a slightly larger separation 
Lsep/S = 3.6, but the minimum skin-friction coefficient has the same level as at 
case III. 

The wall pressure distributions for cases II-IV are juxtaposed in Fig. 4b. The 
large laminar separation bubble at case II is visible by the plateau in the pressure 
distribution at x/S = 15. Such a plateau is not observed at the shock impingement 
in the boundary-layer region at both shock strengths. 
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Injection 


Fig. 5 Turbulent structures visualized by the Q criterion with mapped-on Mach number contours; 
shock- and expansion-fan contours at duj/dxj ■ S /«oo = —0.2, case II 



Fig. 6 Instantaneous numerical Schlieren image at the centerline of the computational domain at 
case II; (a) shear layer, (b) mixing layer, (c) potential-core region, (d) laminar slot boundary layer, 
(e) laminar separation bubble, (f ) incident shock wave, ( g ) reflected shock wave 


5.2.2 Instantaneous Flow Field 

Figure 5 shows the vortical structures visualized by the Q criterion [27] with 
mapped-on Mach number contours for the shock impinging upon the potential- 
core region (case 11). Shock waves are indicated by gray contours of the velocity 
divergence 3 m,- / dx, ■ S/u 00 = —0.2. The incoming turbulent boundary layer above 
the slot is disturbed by an expansion fan which is followed by a shock wave. In 
the injected cooling film no vortices are present. The upper border of this region is 
defined by the mixing and shear layers which emanate from the lip. The incident 
oblique shock wave penetrates through the shear and mixing layers. Downstream 
of the reflected shock vortices are detected in the cooling flow terminating the 
potential-core region. 

These flow features at the shock-impingement position within the potential-core 
region (case II) are clearly visible in the instantaneous numerical Schlieren image 
shown in Fig. 6. The potential-core region (c), which is encompassed by the laminar 
slot boundary layer (d) and the mixing layer (b), extends even downstream of the 
shock impingement position at this time level to x/S = 14.5. Further downstream 
at x/S = 17, disturbances appear in the potential-core flow which is near the 
time-averaged reattachment point Yr/S = 16.3 of the separation bubble (e). 
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Fig. 7 Cooling effectiveness 
and wall temperature vs. 
streamwise distance for all 
cases, (a) Cooling 
effectiveness, (b) Wall 
temperature 


a 




r/S 


5.2.3 Cooling Effectiveness 


The temporal and spanwise averaged cooling effectiveness definition reads 



(1) 


where the quantity 7’ aw denotes the Favre-averaged adiabatic wall temperature, T IOO 
the recovery temperature of the freestream, and the recovery temperature of 
the cooling flow. The cooling effectiveness distribution in Fig. 7a for case I shows 
a good agreement with the experiments of Juhany et al. [11]. An in-depth analysis 
of this cooling configuration can be found in [24]. Konopka et al. [25] investigated 
a similar cooling configuration with a lower injection Mach number of Ma\ = 1.2. 
Note that Konopka et al. found that the numerical cooling effectiveness is 9 % lower 
when using a spanwise domain size of z/ D = 2.2 compared to z/D = 4.4 since 
some large-scale turbulence is not captured. However, the conclusions drawn in this 
study are still reasonable since the magnitude of the overestimation is the same 
for cases I-IV. In Fig. 7a it is evidenced that at the shock impingement upon the 
potential-core region (case II), the cooling effectiveness is reduced beginning at the 
separation point of the laminar slot boundary layer at Xs/S = 9.8. Downstream 
of the transition of the laminar slot boundary layer downstream of the separation 
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bubble, the cooling effectiveness decreases further. The turbulent slot boundary 
layer immediately mixes with the mixing layer emanating from the lip which 
extends at this downstream position towards the wall. Therefore, it is clear that 
at case II, besides the reduction of the recovery temperature of the wall by the 
reduction of the local Mach number, downstream of the shock wave the transition 
of the laminar slot boundary layer plays an important role. At the more downstream 
shock impingement position at Xi mp /5 = 60 within the boundary-layer region (case 
III), the streamwise slope of the cooling effectiveness has increased compared to 
the zero pressure gradient configuration, indicating increased heat and momentum 
transfer due to the shock-wave impingement. At case IV, where the flow-deflection 
angle is increased by 3° compared to case III, the streamwise slope of the cooling 
effectiveness decline is even steeper. 

The cooling effectiveness values which are slightly above unity at x/S = 10 
are due to the expansion fan and shock wave emanating from the lower tip of the 
lip which impinge upon the laminar slot boundary layer. This is evidenced by the 
temperature dip at x/S = 2 in the wall-temperature distribution in Fig. 7b. 

Besides the averaged wall temperatures the frequent changes of the wall tem¬ 
perature have to be considered in the design process of the wall material of a 
Scramjet engine, as they might lead to thermal fatigue or local hot spots in the 
engine. Therefore, the cooling effectiveness fluctuations 



are evaluated for the shock-cooling film interactions in Fig. 8a. It is evident that 
the highest cooling effectiveness fluctuations occur at x/S = 17.82 at case II, 
which is downstream of the averaged reattachment point of the separation bubble. 
At the more downstream shock-impingement position at cases III and IV, the level of 
cooling effectiveness fluctuations is 16 % lower than at case II and hardly affected 
by the shock strength, since there is barely any difference between cases III and IV. 
To investigate the cause for the high cooling effectiveness fluctuation level at the 
shock impingement upon the potential-core region, conditional averages of the flow 
field are considered. In Fig. 8 the dashed line corresponds to the averaging threshold, 
i.e. all snapshots of the flow field with a local cooling effectiveness rj > 1.1 are 
considered. The averaged total temperature contours shown in Fig. 9 consist of 
51 snapshots of the flow field. Figure 9c shows the conditionally averaged total 
temperature contours collected at the time when the condition j] > 1.1 holds. It is 
visible that a region of cold fluid exists at the position of high cooling effectiveness 
fluctuations. The origin of the cold fluid is analyzed by considering the conditionally 
averaged flow field at At /(S/Moo) = —2.5 shown in Fig. 9b before the time when 
rj > 1.1 is satisfied. It is evidenced that the region of cold fluid is now located 
upstream of the point of maximum cooling effectiveness fluctuations. Considering 
the conditionally averaged flow field at At/(S/uoo ) = —5.5 in Fig. 9a it is visible 
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Fig. 8 Cooling-effectiveness 
fluctuation distributions vs. 
streamwise distance and 
instantaneous cooling 
effectiveness vs. time. 

(a) Cooling effectiveness 
fluctuations. 

(b) Instantaneous cooling 
effectiveness at 

x/S = 17.82, z/S = 1.1 for 
case II; dashed line 
corresponds to a cooling 
effectiveness 77 = 1.1 



x/S 

b 



that the region of cold fluid is located upstream compared to the time interval 
At/iS/uoo) = —2.5 and is now part of the separation bubble. Note that the 
separation bubble has shrunk comparing the time intervals At/(S/u oa ) = —5.5 
and At/{S/u 0G ) = —2.5 showing that the region of cold fluid has been shed off the 
separation bubble. Hence it can be concluded that the shedding of cold fluid off the 
separation bubble is responsible for very high cooling effectiveness values. 


5.2.4 Mean Flow Field 

To analyze the impact of the shock waves on the mixing in the flow field near the 
wall, the dimensionless fluid temperature 


0 = 



( 3 ) 


is evaluated. Its definition is similar to the cooling effectiveness Jj except that total 
temperatures are used as reference temperatures. The quantity 0 reaches a value of 
1 in the cooling flow and 0 in the freestream. Figure 10a shows dimensionless fluid 
temperatures in the slot vicinity for the zero-pressure gradient configuration (case I) 
and the shock-cooling interaction within the potential-core region (case II). The 
spreading of the mixing layer at increasing downstream distance is evident in the 
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Fig. 9 Conditionally averaged total temperature contours in the x,y-plane at z/S = 1.1 for 
case II, the arrow indicates the point of maximum cooling-effectiveness fluctuation and the circle 
marks regions of low energetic fluid, (a) At/(S/uoo) = —5.5. (b) At/(S /«oo) = —2.5. 
(c) At/(S/ Uoo ) = 0.0 


profiles, e.g., at x/S = 10. It is the region where 0 varies from 1 to 0 at increasing 
wall-normal distances. Downstream of the shock-wave interaction at case II at 
x/S = 25, the dimensionless fluid temperature is lower than at case I indicating 
increased mixing downstream of the shock compared to the zero-pressure gradient 
configuration case I. Further downstream, at the stronger shock-cooling interaction 
in the boundary-layer region at case IV, the dimensionless fluid temperature rises 
first at x/S = 50 due to the displacement of the separation bubble compared to the 
case I profile. Further downstream, the dimensionless fluid temperature at case IV 
is quickly reduced throughout the entire boundary-layer profile and at x/S = 80 
it almost matches the case II profile. At this downstream position, the cooling 
effectiveness of these two cases is also alike (Fig. 7a). 


5.2.5 Turbulence Statistics 

Shock waves impinging upon turbulent boundary layers are known [28] to lead to 
increased turbulence levels downstream of the impingement point. This increased 
turbulence levels lead to increased mixing and heat and momentum transport of the 
cooling flow and the freestream. 
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Fig. 10 Dimensionless fluid 
temperature profiles 0 at 
several downstream positions, 
grid spacing is 0 = 1. 

(a) x/S = 0, 10, 15, 25. 

(b) x/S = 40, 60, 70, 80 




This is evidenced by the Reynolds shear stress profiles in Fig. 11. At case I 
the upper boundary of the potential-core region is located where u"v"/u 2 0 begins 
to deviate from zero at increasing wall-normal distances (Fig. 11a). At increasing 
streamwise distance from the slot, the upper boundary moves towards the wall. 
At shock-impingement, the profile at x/S = 15 at case II shows a negative 
peak at y/S = — 0.9 nearly at the end of the separation bubble. Further down¬ 
stream of the shock-impingement position, absolute u"v"/u 2 ^ levels have risen 
at x/S = 25, —1.16 < y/S < 1. Therefore, it is evident that the shock-wave 
impinging upon the potential-core region has led to the transition of the laminar 
slot boundary layer and has reduced the length of the potential core region. At 
shock impingement within the boundary-layer region negative peaks of u"v"/u 2 ^ 
are detected at x/S = 60 close to the wall, which move off the wall further 
downstream. These higher u"v"/u 2 x values of the cases III and IV than of the 
cases I and II in the boundary-layer region at x/S = 80 could explain the steeper 
streamwise slope in cooling effectiveness observed in Fig. 7a. 


5.3 Computational Resources 

The simulations for this study were performed on the vector computer NEC SX-9 at 
the HLR Stuttgart. The grids are divided into blocks which reside on a single CPU. 
Data is exchanged using the message passing interface (MPI). The computational 
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Fig. 11 Reynolds shear 
stress component at several 
downstream positions, grid 
spacing is u" v"/u 1 ^ = 
0.0035 (a) x/S = 

0, 10, 15, 25. (bU/S = 
40, 60, 70. 80 



b 





Table 2 Performance on NEC SX-9 



Case 1 

Case Ilarge 

Case multirow 

Number of CPUs 

7 

16 

30 

Number of Nodes 

1 

1 

2 

Grid points/CPU 

2.44- 10 6 

1.51 • 10 6 

0.5- 10 6 

Total grid points 

17.1 • 10 6 

24.1 • 10 6 

15- 10 6 

Avg. User Time (s) 

41,552 

41,556 

41,761 

Avg. Vector Time (s) 

36,827 

38,502 

38,844 

Vector Operations Ratio (%) 

99.6 

99.7 

99.54 

Avg. Vector Length 

240.585 

251.439 

235.25 

Memory/CPU (MB) 

2,422 

1,998 

1,426 

Total Memory (GB) 

16.95 

31.98 

42.8 

Avg. MFLOPS/CPU 

13,997 

17,873 

8,587 

Max. MFLOPS/CPU 

19,252 

19,590 

15,854 

Total GFLOPS 

97.8 

285.881 

257.5 


details are given in Table 2, where “Case I” denotes the case I of the current slot 
cooling computation. It is evident that the flow solver is fully vectorized with a 
vector operation ratio higher than 99 %. Furthermore, a “Case Ilarge” is given, 
which is similar to “Case I” but with a larger streamwise domain extent. The grid of 
this case is 1.4-fold larger than that of case I and the computation reaches an overall 
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performance of 285 GFLOPS. A different computation for the analysis of a cooling 
configuration using several rows of discrete holes is denoted “Case Multirow” in 
Table 2. This computation was run on two SX-9 nodes since it has a complex grid 
with small grid cells and large block surfaces. Therefore, the overall performance at 
“Case Multirow” is lower than at “Case Ilarge” at 62 % of the grid points of “Case 
Ilarge”. 


6 Conclusion 

Large-eddy simulations of shock-cooling-film interactions have been performed. 
The shock-waves impinge upon the potential-core (case II) and boundary-layer 
regions (cases III and IV). At the shock-wave impingement position within the 
potential-core region, the transition of the laminar slot boundary layer occurred 
downstream of the separation bubble. The increased turbulence levels in the shear- 
and mixing layer located between the cooling flow and the freestream led to a 
decrease of cooling effectiveness compared to a zero-pressure gradient configu¬ 
ration (case I) of 33 %. Large cooling effectiveness fluctuations were detected at 
shock-impingement downstream of the separation bubble since it sheds off large 
patches of cold fluid. At the shock impingement upon the boundary-layer region at 
the same shock strength as at the shock impingement upon the potential-core region, 
a less drastic decrease of cooling effectiveness (17 %) was observed. However, the 
streamwise cooling effectiveness gradient showed a higher magnitude (case III) 
compared to the zero-pressure gradient configuration (case I). At increasing shock 
strength at the further downstream impingement position (case IV), the cooling 
effectiveness decreased even more rapidly, i.e., the streamwise cooling effectiveness 
gradient showed an even higher magnitude. 
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Large-Eddy Simulations of Stratified and 
Non-stratified T-junction Mixing Flows 


David Kloren and Eckart Laurien 


Abstract Thermal mixing of coolants with large temperature differences in cooling 
circuits of power plants may lead to high cycle thermal fatigue in the pipe material. 
In these mixing regions cracks are often observed in the vicinity of weld seams. 
In this study the influence of a weld seam in straight pipe flows, isothermal 
T-junction mixing flows and stratified T-junction mixing flows are investigated with 
Large-Eddy Simulations (LES). The results are compared to experimental data. 
Furthermore, T-junction flows with large temperature differences, which are based 
on an experimental setup at the Institute of Nuclear Energy and Energy Systems 
(IKE), are numerically investigated and characterized. 


1 Introduction 

The safety analysis of a nuclear power plant involves many components which 
are subjected to thermal loading. On the one hand, system transients and steady 
state stratification effects lead to Low Cycle Fatigue (LCF). On the other hand, 
the temperature fluctuations caused by flow instabilities of non-isothermal mixing 
flows may lead to the premature loss of the mechanical integrity of the surrounding 
material. Such High Cycle Fatigue (HCF) due to thermal striping has been observed, 
for instance, in a mixing tee pipe configuration of a residual heat removal system [2]. 
In contrast to LCF, Fluid-Structure-Interaction (FSI) associated with high cycle 
fatigue is not sufficiently understood. Furthermore, the plant instrumentation is not 
designed to monitor thermal transients or the flow conditions related to HCF. The 
assessment of safety and reliability of these components are therefore extremely 
conservative. 
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A strategy based on numerical simulations is adopted in order to provide the 
characteristic thermal loads which act on the pipe walls. Hence, the models applied 
in Computational Fluid Dynamics (CFD) have to be validated for the phenomena 
such as thermal mixing, buoyancy effects and FSI. 

The studies performed on the T-junction problem (e.g. Westin et al. [15]) were 
focused mainly on the mixing zones and the flow downstream of a T-junction. 
Most experiments are designed for low inlet temperature differences in order to 
allow optical access for non-intrusive measurements of the velocity and temperature 
field. For this reason, a new hot T-junction setup that allows optical access and 
temperature measurements in the near-wall region of the fluid is currently tested 
at the IKE [8], Thermally coupled Large-Eddy Simulations of this setup were 
performed [7] to investigate the fluid-structure interaction. 

Usually, straight smooth pipes upstream and downstream of the T-junction 
were considered. However, cracks were typically observed in the vicinity of weld 
seams [1]. In the framework of the CFD validation of mixing flows in a T-junction 
the influence of the weld seam has not been addressed previously. 

Firstly, this numerical study investigates the influence of a weld seam on the flow 
field of fully developed turbulent pipe flow, isothermal T-junction mixing flow and 
stratified T-junction mixing flow. The results are compared against measurements. 
The experimental setup is designed for the development and application of optical 
measurement methods under cold conditions (20 °C) which will be employed in the 
hot T-junction experiment with temperatures up to 280 °C and a pressure of 75 bar. 

Secondly, a first experimental study of the thermal conditions in the hot 
T-junction flow configuration indicates a stable stratified flow with upstream flow of 
cold water in the hot line and vice versa. In order to characterize the flow conditions 
further numerical simulations are performed with temperature differences up to 
150°C. 


2 Setup 

The geometric proportions are based on a T-junction experiment designed by the 
Institute of Nuclear Technology and Energy Systems (IKE) and the Material Testing 
Institute (MPA) at the University Stuttgart for the research in thermal fatigue and 
CFD model development [8]. The setup consists of a larger pipe with an inner 
diameter (ID) of d\ = 71.8 mm. The sharp edged, 90° T-junction is connected to 
the main pipe and a branch pipe with an ID of <7 2 = 38.9mm. Three different basic 
configurations are considered: a single pipe flow (A), a T-junction mixing flow with 
constant fluid properties (B) and a T-junction mixing flow affected by buoyancy 
forces and stratification (C). Although each configuration was investigated with and 
without weld seam model numerically as well as experimentally, only the cases 
including the weld seam model are considered for this discussion. 
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Fig. 1 Geometry of straight pipe (left) and T-junction (right) with weld seam. The numerical 
T-junction model is shown as a cut (in the symmetry plane) for better visualization. The position 
of the weld seam model is the dent in the surface downstream of the T-junction 


Table 1 Simulation test 
cases for the weld seam 
investigations. Simulation 

Case no. 

Setup 

Weld seam 

Volumetric flow rate 

QiVM 

Q 2 [i/s] 

without weld seam (Case 7) 

1 

A 

Yes 

0.8 

— 

described by Kloeren [7] 

2 

A 

Yes 

0.4 

- 


3 

A 

Yes 

0.4 

0.1 


4 

B 

Yes 

0.5 

0.16 


5 

B 

Yes 

0.7 

0.51 


6 

C 

Yes 

0.4 

0.1 


7 

C 

No 

0.4 

0.1 


For the straight pipe case A (Fig. 1, left) the weld seam model is positioned so 
that the influence on a fully developed turbulent pipe flow is investigated. For the 
T-junction cases (B, C), it is placed 4d\ downstream of the origin of the T-junction 
coordinate system, which is defined as the intersection point of both the main 
and branch pipe centerlines (Fig. 1, right). The working fluid is water at ambient 
temperature of 20 °C and ambient pressure level for both the pipe flow and the 
isothermal T-junction setup (A, B). 

In case of the buoyancy-affected T-junction flow (C), a temperature difference of 
AT = 100°C at a pressure p = 75bar (7j = 120°C, T 2 = 20°C) is chosen for the 
numerical investigations. This is similar to the T-junction simulation by Kloeren [7] 
which is based on the hot experiment at the IKE. The experimental studies at the 
cold test-rig employ sugar water in order to realize a comparable density difference. 
Various volumetric flow rates Q\ and Qi as well as flow rate ratios Q\/Qi 
are employed. The numerical setups considered for the weld-seam discussion are 
summarized in Table 1 . 

An additional case with a temperature difference of 150 °C (Case-No. 8) and 
without a weld seam is considered for the characterization of the hot T-junction 
flow. The same mass flow rates as case no. 7 (Table 1) are applied. The domain for 
the inlet flows, however, is extended in upstream direction. 
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3 Physical Model 


Both the turbulent thermal mixing in a T-junction and the flow field affected by a 
weld seam involves highly anisotropic flow conditions. Furthermore, the unsteady 
flow field needs to be resolved in order to provide characteristic temperature fluctu¬ 
ations for fatigue analysis. Consequently, the main effort of CFD validation for the 
T-junction problem has been directed to LES methods or related hybrid turbulence 
models, such as Detached-Eddy Simulation or Scale-Adaptive Simulation [5]. 

The LES approach solves the energy-containing large-scale turbulence without 
turbulence modeling. Hence, the LES is always a three-dimensional and tran¬ 
sient computation on a high quality mesh. The separation of turbulent scales is 
mathematically achieved by a filter operation [3], by which an additional term 
for the small-scale turbulence is introduced. These stress components x^sgs (1) 
are modeled by the so-called Subgrid-Scale (SGS) models. It is assumed that the 
turbulence becomes increasingly isotropic towards the dissipation scales and thus 
simple algebraic SGS models based on the mixing-length hypothesis are widely 
utilized. 

For this work, the dynamic Smagorinsky SGS model by Germano [4] and 
Lilly [ 1 0] is chosen. With Reynolds numbers (Re) below 4 x 10 4 for all test cases the 
LES of wall bounded internal flows can be performed in reasonable computational 
time. The SGS shear stress r y t sGS is modeled as follows: 

r,i,SGS = -2 fx, (Stj - -SkkSij ) (1) 


with the filtered strain rate tensor of the resolved scales 

tt _ 1 f dui , faj \ 
U ~ 2 [dxj + dxj 


( 2 ) 


In the Smagorinsky-Lilly model the subgrid-scale turbulent viscosity //, (3) depends 
on the strain rate tensor (2), the cell volume V and the model constant Cs . The linear 
function of the length scale containing the Karman constant k = 0.41 and the wall 
distance d ensures the correct wall behavior in the near-wall region. The filter length 
A is defined as A = V 1 ^ 3 . 


ix, = p(m\n(Kd, CsAfyjlSijSij) (3) 

The local model coefficient Cs is dynamically calculated, assuming scale similarity. 
Cs is derived from the difference of the resolved solution for both the grid filter 
width A and the test filter width 2 A. For numerical stability reasons the Cs value is 
limited in the range from 0 to 0.23. 

The subgrid-scale turbulent heat flux is modelled as follows: 


jx,c p 3 T 
Pe.sgs dx . 


*0,i = 


(4) 
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The subgrid-scale turbulent Prandtl number Pr r sgs is dynamically calculated 
according to Lilly [10]. Scale similarity is assumed again and Pr,^ can be 
explicitly determined based on the resolved solution of the grid filter and the 
test filter. 

In case the viscous sub-layer is not sufficiently resolved a wall function with 
a smooth interpolation between the sub-layer and the logarithmic wall region is 
applied. The blending function used for the buffer layer is similar to the formulation 
utilized by Kader [6], 


4 Numerical Methods 

In the commercial finite volume solver ANSYS FLUENT the SIMPLE algorithm 
is chosen for the velocity-pressure coupling. The second order central differencing 
scheme is applied for the convection terms for both the momentum and the energy 
equation. For the pressure interpolation the PRESTO scheme [12] is applied. 


4.1 Boundary Conditions 

The weld-seam experiments are conducted in a cold state in pipes made of PVC. 
For this reason, the simulations apply smooth adiabatic wall boundary conditions in 
all cases. The velocity inlet conditions are taken from RANS simulations of fully 
developed turbulent pipe flow. The outlet is set to pressure outlet conditions. 

Although LES usually requires unsteady turbulent fluctuations at the inlet, 
sometimes the inlet turbulence might be insignificant compared to the turbulence 
produced by the flow instabilities, e.g. in the mixing zone of a T-junction. In 
the straight pipe flow cases, no such possibly dominating flow instabilities occur 
upstream of the weld seam model. For this reason, artificial inflow turbulence is 
modeled by the vortex method [11] for all the considered cases. RANS of fully 
developed pipe flows are performed with the Reynolds Stress Model (RSM). The 
random planar inlet fluctuation are emulated based on the profiles of the mean 
velocity, turbulent kinetic energy, turbulent dissipation rate and the three Reynolds 
normal stresses. Based on the inlet fluctuation the inlet flow has 3d\ to recover the 
proper turbulent field. 


4.2 Time Discretization 

Time steps from 1 to 2 ms are chosen with a Courant number less than unity in 
most of the domain. Courant numbers up to 3 can occur locally. However, due to 
the second order backward Euler time integration scheme this does not pose a threat 
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Table 2 Dimensionless size 
of the wall-adjacent cell 
volume according to best 
practice guidelines (BPG) 
listed in [3] 


- 

Ax + 

Ay+ 

Az+ 

Sagaut [14] 

10 

2 

5 

Frohlich [3] 

50 

2 

15 

Piomelli [13] 

100 

2 

40 

This study 

< 50 

< 3 

< 15 


Fig. 2 Numerical mesh close 
to the weld seam modeled as 
a half-circle (width 
= 8.3 mm, height = 4 mm) 



to the numerical stability. Before recording the data for turbulent statistics the flow 
was given time to develop so that the flow has passed through the full geometry at 
least once, based on the bulk velocity. At least 8-10 s are calculated for the turbulent 
statistics. 


4.3 Numerical Grid 

The geometry and the local coordinate system are shown in Fig. 1. The compu¬ 
tational domain extends I0d\ downstream of the inlet boundary for the single 
pipe case (A). The domain for the T-junction cases (B, C) extends from 3d\ 
upstream to \0d\ downstream of junction. For all setups the weld seam geometry 
is modeled with a half-circle profile (width = 8.3 mm, height = 4 mm) and placed 
4d\ downstream of either the inlet boundary (A) or the intersection (B, C). For this 
problem the numerical grid contains ca. five million cells. 

The numerical mesh is designed according to the best practice guidelines by 
Frohlich [3] for a wall-resolved LES. The degree of restrictiveness for the wall 
adjacent control volume lies between the guidelines suggested by Sagaut [14] and 
Piomelli [13] as summarized by Frohlich and shown Table 2. The dimensionless 
length (stream-wise), height (wall-normal) and width (transversal) of the first 
cell from the wall surface are denoted as Ax + , Ay + and Az + respectively. 
The superscript “+” indicates a non-dimensional length scaled by the friction 
velocity u T and the kinematic viscosity v. Occasionally, Ay + exceeds slightly the 
recommended value, so that it can still be considered a wall resolved LES. The mesh 
in the weld seam region is shown in Fig. 2. 
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Fig. 3 Pipe Flow: Mean 
value of normalized u mean 
at 0.3/t/i downstream of the 
weld seam 



Position z/D 


Fig. 4 Pipe Flow: RMS 
value of normalized u mean 
at 0.3t/i downstream of the 
weld seam 



5 Results 


The arithmetic mean values and root mean square (RMS) values of the fluctuations 
are compared. For an entity A with a set of N data points they are defined as: 


A 


mean — 


EA 

N 


Arms — 


- A mean ? ^ 12 


(5) 


5.1 Pipe Flow 

The simulation results are compared to P1V data in Figs. 3 and 4. Initially, the inlet 
mass flow rates for both the simulations and the experiments were set identically, 
based on the hydrostatic pressure in the water supply tank. Flow rate measurements 
based on PIV, however, proved to be more reliable in this case and the experimental 
mass flow rates are adjusted accordingly. 
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Fig. 5 T-junction: Mean 
value of normalized u mean at 
0.3/afi downstream of the 
weld seam for different flow 
rate ratios Q\/Q 2 



Position z/D 


The vertical profiles of the mean (Fig. 3) and RMS values (Fig. 4) of the velocity 
are compared at the axial position x/d\ = 0.3 downstream of the weld seam. The 
velocities are scaled by the bulk velocity iibuik ■ Both LES and the experiments show 
similarity of the profiles for varying volumetric flow rates Q. The recirculation 
bubble is indicated by the negative mean velocities close to the wall. 

The RMS-values have a minimum in the middle of the pipe and increase towards 
the walls up to 35% of Ub„tk- The reference bulk velocity is calculated by the 
local one-dimensional vertical velocity profile in order to be consistent with the 
normalization methods of the experimental data. 

While the core flow profiles overlap very well, the recirculation of flow is 
overestimated in the simulation. Furthermore, the near-wall magnitude of the 
velocity RMS for flows with similar flow rate is larger for the experimental data. 
This corresponds to the slightly steeper velocity gradients in the mean velocity 
profile. 


5.2 Non-stratified T-junction Flow 

All cases with a weld seam installed display display a flattened core mean velocity 
profile (Fig. 5). The flat, slightly M-shaped core profile is captured well by the 
simulation. The M-shape resembles a wake flow since the intersecting branch pipe 
inflow acts as an obstacle for the main water stream. For low flow rates the profile 
tends to lean towards the bottom of the pipe. The blue and the read simulations 
display a strong similarity of their mean velocity profiles despite the difference 
between the absolute values of the inlet flow rates as well as the flow rate ratios 
(Q1/Q2 = 4/1 and 61/62 = 2 / 1 ). 

For the LES results small recirculation bubbles can be identified on both sides 
of the wall but not for the experimental results. The peaks of the RMS values 
(Fig. 6 ) of the lower flow rates (Case 3, red line) are underestimated compared to 
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Fig. 6 T-junction: RMS 
value of normalized u mean at 
0.3di downstream of the 
weld seam for different flow 
rate ratios Q 1 /Q 2 



Position z/D 



Fig. 7 Stratified T-junction flow with weld seam (experiment) - Mean value (left) and RMS value 
(right) near-wall equivalent temperature field are shown. The blue bar indicates the weld seam. 
The main flow direction is from left to right 


the experiment (black dots). The underestimation can be up to 25 %. For larger flow 
rates (green line, blue “x”) the agreement is very good. The flat profile of the RMS 
values of the core flow, however, is well reproduced as well the location of the RMS 
peaks. 


5.3 Stratified T-junction Flow 

For the stratified case first measurements of the new so-called Near-Wall-LED- 
Induced-Fluorescence (NW-LED-IF) method [9] are shown. This technique is based 
on the use of dissolved fluorescent dyes with temperature-dependent properties, 
which are excited by a green LED light source. It allows the measurement of 
unsteady temperature or density fields within 1 mm distance from the wall shown 
in Fig. 7. Sugar was dissolved in the water of the branch line so that the density 
fractions of sugar-water mixing and hot water mixing (T\ = 120°C, T) = 20°C) are 
similar. 
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Fig. 8 Stratified T-junction cases with weld seam (LES) - Mean value (left) and RMS value (right) 
of the near-wall temperature field are shown. The white vertical bar in the middle of each field 
indicates the weld seam. The whole pipe is shown in the vertical direction. The main flow direction 
is from left to right 


So far, only a qualitative comparison can be considered because of the difference 
of the absolute viscosity and viscosity ratio as well as the different volume flow 
rates for the weld seam case. However, a wavy character of the stratified flow 
field, as described by Kloeren [7] and indicated in Fig. 8 is also observed for the 
experimental case. 

Furthermore, a similar flow situation is found close to the weld seam (Figs. 3-6), 
in which the stratification is shifted downwards (upstream) and upwards (down¬ 
stream) due to the presence of the weld seam. This leads to increased mix-ing and 
fluctuations in the upstream region in the vicinity of the weld. Compared to the 
simulation, the experiment shows a rather stretched region of increased RMS values, 
parallel to the weld line. 

The weld-seam geometry in combination with the recirculation bubbles up- and 
downstream of the weld seam contract the effective inner diameter and accelerates 
the core flow. Due to the wave-like character of the present stratification with the 
related circumferential velocity components the reduced inner diameter causes an 
increased rotation in order to maintain the angular momentum of the core flow. 

For the same reason a counter rotational component in the recirculation bubbles is 
induced to balance the global angular momentum. Figure 9 shows the instantaneous 
tangential velocity in the cross-section (ca. 0.1 d\ downstream of the weld seam). 
The counter-current rotation between the core flow and near-wall recirculation area 
is indicated by the arrows. The tangential velocities can almost reach the bulk 
velocities. The stably stratified flow experiences a sudden shift by these counter 
current angular flows. The increased shear flow and the non-stable temperature 
gradients lead to increased near-wall thermal mixing. 
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Fig. 9 Instantaneous tangential velocity field in the cross section just shortly downstream of the 
weld seam 



Fig. 10 Mean temperature distribution for temperature differences of 100 °C (Case 7, top) and 
150 °C (Case 8, bottom). The arrows in the xz-plane (left) and the yz-plane (right) indicate the 
inflow of the hot and the cold flow, respectively. Red indicates the maximum temperature and blue 
is 20 °C 

5.4 Onset of Upstream Flow for Large 
Temperature Differences 

First experimental and numerical investigations of the hot T-junction setup indicate 
a upstream flow of cold fluid into the main pipe. Knowledge about the onset and 
development of the upstream flow is important in order to conduct measurements of 
inlet flow conditions. 

The results of the LES of Cases 7 and 8 are shown in Fig. 10 and shows that a 
temperature difference of 150 °C induces an upstream flow in the main pipe of more 
than 3d\ (Fig. 10, left). An elongated recirculation area is formed which extends 
from the branch pipe to the tip of the cold flow upstream. A similar situation 
can be identified in the branch pipe. For 150 °C hot fluid can be expected around 
3d 2 upstream and has to be considered for optical measurements in the branch 
pipe. Additionally, LES of T-junction flow with high temperature differences of 
260 °C have been performed as well. However, a domain with an inlet pipe of 25di 
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upstream of the junction proved to insufficient to ensure a negligible upstream effect 
on the flow field near the inlet boundary condition. Although simulation results can 
not be considered valid under those conditions the extend of the upstream flow does 
not contradict experimental observations. 


6 Computational effort 

The scalability for the thermal T-junction mixing flow case with weld seam is shown 
in Fig. 11. Since only multiples of nodes have to be used (one to eight cores per 
node) the scaling performance is analyzed against the number of nodes instead of 
number of processors. The speed-up (Fig. 11, left) indicates that using half of the 
available cores scales better than using all of the eight cores, so that the average 
time per iteration (Fig. 11, right) is slightly smaller for 15 nodes with 60 instead of 
120 cores. However, the savings in average wall clock time per iterations between 40 
cores on 5 nodes and 60 cores on 15 nodes are minimal. For this test case the scaling 
performance for 8 cores per node behaves well up to 5-10 nodes. The meshes are 
partitioned automatically and the load is equally distributed on the available cores. 
The data files are written serially. To improve the simulation time the residuals are 
printed out only once per time step. 

For the T-junction simulations considered in the weld seam investigation usually 
40 cores were used for a single simulation (8 CPU’s per node). For this mesh of ca. 
five million cells a 24 h run results in a simulation time advancement of ca. 4 s. In 
average, each computation require ca. 1 x 10 4 CPU hours. The straight pipe flows 
grids consists of ca. three million cells and 2-5 nodes were generally used. 

Additionally, RANS simulations with the Reynolds Stress Models are calculated 
in order to provide the inflow profiles for all flow rates of both the main and the 
branch pipe as well as for all different temperatures for the main inlet flow. These 
steady state simulations are very small compared to the LES and require ca. 100 
CPU hours each. 

In order to study the upstream flow conditions for main inlet temperatures of 
7) > 170°C the high temperature T-junction case features a larger mesh with 
ca. 12 million nodes. These simulations are performed initially with adiabatic wall 
conditions. 

Furthermore, a LES with a refined mesh was performed, which was not discussed 
above, based on the thermally coupled T-junction flow simulation described in [7]. 
This mesh ensures a y + value of less than unity and contains ca. 19 million nodes. 
The total physical time for the fine grid solution is 30 s. In average 12 nodes (96 
CPU’s) were used for this large simulation. A detailed investigation of the grid- 
refinement study will be conducted, especially in respect to the heat transfer. The 
computational effort for this particular simulation is ca. 2.6 x 10 4 CPU hours. The 
initial values are interpolated from an instantaneous solution on a coarser grid. This 
initial simulation was run for more than 150 physical seconds to provide a near 
statistically steady condition for the evaluation of unsteady thermal interaction. 
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Fig. 11 Scalability for the weld seam simulations of ca. five million cells. Each node features 8 
cores of which either 4 (red) or 8 cores (blue) are utilized. The speed-up (left) is based on the 
solution on 2 nodes. The average wall clock time per iteration is shown for 1,500 iterations 


Particularly, the steel pipe wall requires a very long time to establish a proper 
temperature field determined by the complex flow field, which cannot be achieved 
with steady state simulations. 


7 Conclusion 

Large Eddy Simulations of straight pipe flow, isothermal T-junction mixing flow and 
stratified T-junction flows influenced by a weld seam are performed. The LES data 
are validated against experimental data of a straight pipe flow and an iso-thermal, 
not stratified T-junction mixing case. 

A new experimental setup for the investigation of the weld seam on the flow 
field is introduced. PIV measurements and so called Near-Wall-LED-Induced- 
Fluorescence technique designed for highly pressurized and pipe flow experiments 
with large temperature differences are presented. 

The overall agreement between LES, using a high quality mesh, and the 
experiment for the isothermal cases is good. The similarity of the profiles in 
regard to the volumetric flow rates has been shown both in the simulation and 
the experiments. However, the LES with the dynamic Smagorinsky model slightly 
underestimates the peaks in the RMS, induced by the weld seam. 

The influence on the near-wall scalar showed that LES is capable to reproduce 
the complex flow condition of wavy stable stratification disturbed by weld seam. It 
includes flow separation and the sudden rotational shift of stratification layer. This 
leads to narrow regions with enhanced near-wall scalar fluctuations close to the 
weld seam, which is consistent with characteristic thermal fatigue pattern in pipe 
configurations. 
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This effort is part of an ongoing project in which the experimental conditions 
and measurement technique will be refined and, eventually, the methods will be 
employed in the hot experiment in order to provide data for CFD validation. 
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Simulation of Compressible Viscous Flow 
with an Immersed Boundary Method 


B. Jastrow and F. Magagnato 


Abstract An immersed boundary method combined with a wall-layer approach 
has been implemented into an established flow solver. In the outer flow field, the 
compressible Navier-Stokes equations are solved using an approximate Riemann 
Solver whereas simplified boundary-layer equations are solved near the wall. 
Turbulence is accounted for by the one-equation model of Spalart-Allmaras in the 
outer flow region and by a mixing length eddy viscosity model with near wall 
damping in the wall layer. Computations performed for various test cases show good 
agreement with reference data found in literature. 


1 Introduction 

Block-structured, body-fitted grid generation for simulating flow in complex geom¬ 
etry can be very tedious. The Cartesian-grid immersed boundary method (IBM) 
offers an interesting approach since automatic mesh generation can be realized 
easily. Furthermore, due to smoothness and orthogonality, Cartesian grids offer 
high accuracy and efficiency. In the IBM a complex geometry is immersed into 
a regular Cartesian grid. The effect of the body on the flow is mimicked by the 
imposition of proper boundary conditions that act as forcing conditions [1, 2]. 
Several applications of the IBM use a linear interpolation for setting the boundary 
conditions as proposed in [1]. Their validity is therefore only given for grids 
resolved down to Ay + < 1. For flow of high Reynolds number these restrictions 
cannot be held in accordance with acceptable computing time. With the use of a 
wall-layer model one can overcome the need for high near wall resolution. Wall 
models based on incompressible turbulent boundary layer equations were proposed 
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and tested by Balaras et al. [3], Cabot and Moin [4], Wang and Moin [5] in a 
body-fitted context. The applicability of such a wall-layer model in the framework 
of the IBM has been studied by Tessicini et al. [6]. Bond et al. [7] developed a 
compressible wall model, named the diffusion model. The present report studies the 
compressible wall-layer model within the framework of the IBM. The consideration 
of the energy equation allows a temperature distribution in the wall layer which in 
turn influences the thermodynamic quantities. It is shown, that the implementation 
in the in-house flow solver SPARC (Structured Parallel Research Code) [8], in the 
following referred to as SPARC-IBM, provides results for basic test cases in good 
agreement with literature. 


2 Mesh Generation 

The mesh is automatically generated via a ray tracing technique. Based on a 
geometry described by a closed surface triangulation every cell center of a uniform 
Cartesian grid is marked as internal or external [9]. Those internal cells having at 
least one external neighbor cell are marked as wall-layer cells. From the latter the 
normal to the closest wall is computed and stored together with the forcing point 
and the appropriate interpolation neighbors (Fig. 1). 


3 Numerical Method 

The Reynolds-averaged Navier-Stokes (RANS) equations for compressible flow 
together with the Spalart-Allmaras turbulence model are used for the calculations 
presented in this report. The discretization of the convective terms is obtained by an 
approximate Riemann solver (HLLC) [10] whereas the viscous terms are discretized 
with a central difference scheme. The solution vector is updated via a Runge-Kutta 
scheme. 

Inside the wall-layer a simplified set of equations is solved on an embedded grid 
with a minimum resolution of 40 cells. It is assumed that convection is negligible 
compared to diffusion, the normal pressure gradient is zero and the streamwise 
gradients are of orders lower than the gradients normal to the wall. Furthermore, 
the normal velocity is insignificant compared to the tangential velocity. This leads 
to the following equations for the x-momentum and the energy: 


d(pu) _ 9[(M +AOff-jff] 


( 1 ) 


d(pe t ) 


d t 3 y 

pe,) d[u(n + + ( K + K t)j^] 


(2) 
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Fig. 1 Mesh in the near wall 
region: o: internal cell, A: 
wall-layer cell, x: external 
cell,/: forcing point, i: 
interpolation neighbors 



The properties ji, and k, are the turbulent viscosity and the turbulent conductivity 
respectively. In a first stage of implementation, /r,, is obtained by a mixing length 
eddy viscosity model with near wall damping 

— = kv + ( 1 -e”^) 2 (3) 

M 

where k = 0.4, A = 19 and y + defines the dimensionless distance to the wall [5]. 
In the case of laminar flow, the turbulent viscosity is set to zero. At the wall, a 
no-slip boundary condition is applied. At the forcing point, that defines the outer 
edge of the wall-layer, the boundary condition is obtained through an interpolation 
from the flow values of the outer flow field. The equations are further simplified 
by neglecting the time derivative terms and the streamwise pressure gradient. The 
wall shear stress needed for the calculation of y + is evaluated from the velocity 
gradient at the wall. Two tridiagonal systems are created for the velocity and the 
temperature respectively and solved in a segregated way with the other dependent 
variables held constant. Subsequently the turbulent viscosity is updated. Since the 
momentum equation and the energy equation are coupled via the viscosity and the 
velocity, an outer loop runs until the full solution converges. Finally the flow values 
for the position of the center of the wall-layer cell are extracted and provided to the 
outer flow field as a boundary conditions. Assuming steady-state in the wall-layer 
corresponds to an instantaneous response of the wall-layer to the outer flow field 
which creates some error for unsteady flow calculations. 


4 Results 

4.1 Flow Past a Flat Plate 

The first test case is the flow past a flat plate. The wall-layer model is capable of 
capturing both, the linear and the turbulent near-wall behavior. The computation 
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Fig. 2 Comparison of the 
velocity profile in the 
boundary layer at 
Re x = 10. 000 


Fig. 3 Comparison of the 
velocity profile in the 
boundary layer at 
Re x = 10.000,000 




was earned out on a structured multi-block mesh with refinement. The boundary 
layer was resolved with 32 cells making a total of 54,000 cells in the 2D plane for 
the laminar case and resolved with 27 cells making a total of 123,000 cells for the 
turbulent case respectively. The Mach number was chosen to Ma 00 = 0.3. The 
laminar velocity profile was extracted at a Reynolds number of Re x = 10, 000. In 
Fig. 2 the profile is compared with the analytical results of the Blasius-equation [11] 

where f = is the dimensionless velocity and n = yj is the dimensionless 

Uqq Y vx 

wall distance. The numerical results show very good agreement with the analytic 
solution of Blasius. 

Figure 3 shows the obtained velocity profile for Re x = 10, 000, 000 together 
with the experimental data of [12], The agreement for the turbulent calculation is 
quite satisfactory as well. 
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Fig. 4 Streamlines for Re = 20 


Table 1 Separation length L, Separation angle 6s and drag coeffi¬ 
cient Co for Re = 20 



L 

8s 

CD 

Sucker and Brauer [13] 

0.83 

43.3 

2.02 

Dennis and Chang [14] 

0.94 

43.7 

2.05 

SPARC-IBM 

0.93 

41.8 

2.09 





Fig. 5 Instantaneous contours of the vorticity for Re = 200 


4.2 Laminar Flow Past a Circular Cylinder 

To test the IBM together with curved surfaces, the flow past a circular cylinder was 
computed. Two flow regimes are shown, a steady flow at Re = 20 and an unsteady 
flow at Re = 200. The mesh consists of 105,000 cells in the 2D plane with the 
near-wall and the wake region refined. The free stream Mach number in both cases 
was set to Ma^ = 0.3. The streamlines for the flow at Re = 20 are shown in 
Fig. 4 whereas Table 1 lists the separation length and angle and the drag coefficient 
in comparison with reference values from literature [13,14]. The agreement is quite 
satisfactory, only the separation angle is predicted too low, which might be due to 
the lack of considering the pressure gradient in the wall-layer. 

Figure 5 shows the instantaneous contours of the vorticity for an unsteady 
laminar flow around the cylinder at Re = 200. A comparison of the Strouhal number 
and the mean drag coefficient with results from literature [15, 16] is provided in 
Table 2. 
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Table 2 Strouhal number St 
Re = 200 

and mean drag coefficient 

Co for 


St 

CD 

Linnick and Fasel [15] 

0.197 

1.34 

Liu et al. [16] 

0.192 

1.31 

SPARC-IBM 

0.193 

1.38 


5 Computational Efficiency 

In the computation of these results we have been using up to 1,344 AMD-Interlagos 
cores of the CRAY XE6 (Hermit) at HLRS in Stuttgart. The in-house developed 
code Sparc is parallized with the MPI-2 software. The computational time for 
one unsteady calculation in three dimesions using about 10 mio points was about 
24 h. Since we were using more than 10,000 blocks of the finite volume scheme 
we could efficiently distribute the blocks on these 1,344 cores with the domain 
decomposition technique. The load balancing was at about 99 communication is 
done with the CRAY GEMINI node-node interconnection the parallel efficiency was 
close to 95 %. This very good parallel efficience could only be obtained because we 
have reduced the amount of output to a minimum. It is desired that the amount of 
output data should be increased so we need to improve the read/write performance 
of the code in the near future. From our recent investigations we know that a 
higher resolution of the computational mesh is required. We think that using about 
80 million points in the next phase will be adequate for a well resolved unsteady 
calculation. 


6 Conclusion and Ongoing Work 

The immersed boundary method (IBM) with a wall-layer approach has been 
implemented into an established block-structured code for solving compressible 
flow. The computation of basic test cases for laminar flow showed very good 
agreement with results found in literature. Viscosity dominated turbulent flow like 
the flow past a flat plate was also captured well. To overcome the lack of consistency, 
future work comprises the testing of the simplified Spalart-Allmaras turbulence 
model for the wall-layer. It is believed that in contrast to using the algebraic model, 
the full coupling of the outer flow field and the wall-layer provides better results for 
turbulent flows. Furthermore, the streamwise pressure gradient in the wall-layer and 
a flow adapted mesh refinement is about to be implemented. Since the advantages 
of the IBM are only given for the simulation of complex geometries where the time 
for mesh generation is significant, the next step is to validate the implementation for 
3D cases. An example for a complex geometry is given in Fig. 6. 
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Fig. 6 Preliminary solution of the flow field around a complex geometry 
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Numerical and Experimental Examination 
of Shock Control Bump Flow Physics 


K. Niibler, S.P. Colliss, T. Lutz, H. Babinsky, and E. Kramer 


Abstract A method allowing a detailed investigation of the flow physics of shock 
control bumps (SCBs) on an unswept airfoil has been developed by comparison 
of the results of experiments and computations. A simple wind tunnel set-up is 
proposed which is shown to generate representative baseline conditions, allowing 
fine details of the flow to be measured using an array of techniques. Computational 
data for the same bump configuration is then validated against the experimental 
results, allowing a more intimate analysis of the flow physics as well as relating 
wind tunnel results to the performance of the SCB on an unswept wing. 


1 Introduction 

Shock control bumps (SCBs) are a promising method to reduce wave drag on 
transonic wings by inducing a bifurcated shock structure and thus decelerating the 
flow more isentropically [1], However, the interaction between the modified surface 
contour, the wing boundary layer and the system of shocks has been shown to 
lead to a complex flow field [4, 12-14]. In order to find improved SCB geometries, 
specifically those which are more robust towards changing free stream conditions, 
the flow physics need to be better understood [3]. To pursue this target, a joint study 
is in progress analysing the bump flow both in a supersonic wind tunnel and using 
RANS-based CFD. 

CFD can simulate the whole wing with bumps [2], but may struggle in resolving 
the small-scale flow physics and the spearation prediction requires validation. 
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Fig. 1 Region of interest for 
SCB experiments: 

(a) modelled by wind tunnel 
experiments; (b) on a typical 
airfoil 


Fig. 2 Comparison of typical 
pressure gradients on airfoils 
and in wind tunnels in the 
absence of a bump 


a 




However, once a reliable CFD setup is found, a quick analysis with variable free- 
stream conditions is possible, allowing shape optimisations or other studies with a 
high number of required cycles. 

For experimental investigations, a blow-down type supersonic tunnel is preferred 
as runs are relatively inexpensive and repeatable [4], However, in this method 
only the flow local to the bump is being considered and the object boundary 
layer is that grown naturally through the wind tunnel. Although this is a proven 
technique for ensuring that the Reynolds number based on dispacement thickness is 
sufficiently high, certain other considerations bring the validity of such experiments 
into question. These are mainly due to the changing curvature of the nozzle in 
the wind tunnel, which contrasts with the continuous convex surface of the airfoil, 
shown in Fig. 1 . 

This is expected to not only have an impact on the development of the boundary 
layer, but also modify the surface pressure distribution that the SCB is subjected to. 
A comparison of typical pressure distributions for each situation (Fig. 2) indicates 
that the post-shock adverse pressure gradient is higher for the airfoil than the wind 
tunnel. It has been suggested that the flow here is of high importance for the bump 
performance [15], and thus the question of validity will be addressed in the present 
work. 
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2 Methods 

2.1 Numerical Setup 

Solver 

Numerical analyses are performed using FLOWer [7], a well-established structured 
and density-based RANS solver developed by the German Aerospace Centre (DLR). 
The two-equation Shear Stress Transport k -oo [6] turbulence model was used. Except 
for the finest grid in the grid convergence study, no unsteady computations were 
necessary within this work. 

Governing Equations 

FLOWer solves the three-dimensional Reynolds Averages Navier-Stokes equations 
(RANS) in integral form 


_ 9 _ 

dt 


L 


Wdv + 



F • ndv = 0 


with the vector of the conservative variables 


W = [p, pu, pv, pw, pE] 1 


( 1 ) 


( 2 ) 


on block-structured finite volume meshes. The conservative variables are given in 
a Cartesian coordinate system with p, u, v, vv. E denoting the density, the Cartesian 
velocity components of the velocity vector v and the specific total energy, respec¬ 
tively. V represents the control volume and dV its closed outer surface. The flux 
tensor F is divided in a convective inviscid part F c and a viscous part F v such that 

F = F c - F v (3) 

with 


pu 

pv 

pw 


" 0 

0 

o ” 

pu 1 + p 

puv 

puw 


<?xx 

Oxy 

°xz 

puv 

pv 2 + p 

pvw 

and F v = 

Gyx 

Oyy 

0>z 

puw 

pvw 

pw 2 + p 


Gzx 

°zy 

a zz 

puE + up 

pvE + vp 

pwE + wp 


Jfx 

\h 



The i j/i are abbreviations of the type 


h = ( 


, 3 T 

f, = ( uo ix + vtjjy + wa iz + K — 


for i = x,y,z 


( 5 ) 






336 


K. Niibler et al. 


The pressure p is calculated by the equation of state of the perfect gas 

, ,, („ u 2 + v 2 + w 2 \ 

p = (r - i)p I e ---I 

with the specific heats ratio y. The temperature T is defined by 



(6) 


(7) 


The Reynolds stress tensor a,j in (4) and (5), that represents correlations between 
fluctuating velocities and which depends on the fluid viscosity, is given through 

2 

Oij = p(yij + Vj,i - JSij\ k ,k ) (8) 

where (.), j denotes derivation of the /-th component with respect to X j. Likewise 
the heat conductivity K in (5) depends on the viscosity //, as is shown by the relation 


K = 


Y M 
y - 1 Pr 


(9) 


with Pr being the Prandtl number. For laminar flow [i in (8) and (9) is set to // = /i/, 
which is according to the Sutherland law 


Ml = Mo 



Tqq + S 
T + S ' 


where /l 0 = 1.716 x 1(T 5 kg/ (ms ) and 5 = 110.4 K. 


( 10 ) 


Meshing and Acceleration 

The airfoil chosen for this investigation is ‘pathfinder’, a low camber transonic 
airfoil designed for research in the field of natural laminar flow [8], The structured 
meshing is script-based ensuring consistently high quality meshes for each SCB 
configuration, thus reducing the grid influence on the solution to a minimum. All 
calculations performed within this work (except for the grid convergence study) 
were carried out on a C-grid with 5.8 million cells in total. This consists of 480 cells 
around the airfoil, of which 130 are in the wake, and 100 cells in the wall-normal 
direction. The far-field extends to 50 chord lengths from the airfoil, and the grid 
is refined in the boundary layer so that the nearest cell to the wall corresponds 
to Y + fa 1. Different widths of computational domain, from 0.1 to 0.3 c, were 
examined for comparisons between experimental and computational results. The 
number of span wise cells was 120, which was found to be suitable for the highest 





Numerical and Experimental Examination of Shock Control Bump Flow Physics 


337 


span. Periodic boundary conditions are employed so that the set-up represents an 
unswept wing of infinite aspect ratio. A single SCB is applied at centre-span, such 
that the effective bump spacing is set by the width of the computational domain. 

For calculations aiming for a target lift value, an angle of attack-update is 
performed every 1,000 iterations. Figure 3 shows such a case with a target lift value 
of Cl = 0.405. The peaks in residual at 2,000 and 5,000 iterations are caused 
by switching to a finer mesh; this three-stage multi-grid approach was adopted to 
accelerate convergence. Convergence is then continued up to a residual of 10 -6 , or 
up to 9,000 iterations if this target is missed. 


Grid Convergence Study 

A grid convergence study was conducted to find a good compromise between 
computational cost and simulation accuracy. The cell distribution as described in the 
section above represents level 1 as shown in Table 1 . The number of cells in each 
spatial direction was doubled or halved, respectively, to get the next level. The Y + 
value was kept constant for all meshes at about 1 for Re = 20 x 10 6 . The differing 
number of cells required in the wall normal direction was achieved through varying 
growth rates. 

A contour bump was applied to a pathfinder airfoil section of width 0.3 c; 
the exact details of placement are described below (see also [2]). This case is 
considered critical because the flow is very close to separation after the normal 
shock on the bump crest. Moreover it was considered a good test case to qualitatively 
assess the prediction of flow features on the bump. For level 0, URANS with 
100 inner iterations was required for convergence reasons resulting from the weak 
numerical damping. A quasi-steady solution with good convergence and coefficients 
comparable with the steady case was established. 

The aerodynamic coefficients in Table 2 show that a convergence in drag 
prediction is achieved for level 1 and only a small discrepancy remains with respect 
to lift, whilst for levels 2 and 3 similar results are not achieved. The friction 
coefficient distribution shown in Fig. 4 provide further evidence that level 1 provides 
sufficient accuracy, and a similar conclusion could be drawn from the pressure 
distributions (not shown). The level 1 mesh is therefore chosen for further SCB 
investigations as a good compromise between accuracy and computational cost. 


2.2 Wind Tunnel Setup 

Experiments were performed in the No. 1 supersonic wind tunnel at Cambridge 
University Engineering Department, a blow-down tunnel with a rectangular working 
section 114 mm wide by 178 mm high. The stagnation pressure is controlled by a 
manually operated valve with accuracy typically better than 0.5 %, and run times 
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Fig. 3 Typical solution 
convergence history for 
presented setup with target 
lift on 



Iteration 


Table 1 Grid size on levels for convergence study 


Level 

Approx, number of cells (million) 

0 

46.0 

1 

5.78 

2 

0.75 

3 

0.10 


Table 2 Computed aerodynamic coefficients relative to level 0 
(percentage deviations) 


Level 

C L 

c D 

AC l 

ac d 

AL/D 

0 

0.424 

0.0103 

- 

- 

- 

1 

0.418 

0.0103 

-1.45 

-0.42 

-1.03 

2 

0.409 

0.0109 

-3.62 

5.83 

-8.93 

3 

0.364 

0.0164 

-14.3 

58.58 

-45.96 


of up to 30 s are achievable. The nozzle geometry is fixed and set up to operate 
nominally at M = 1.3. The tunnel is equipped with an ejector system to enable a 
degree of control over the incoming boundary layer properties via either localised 
or distributed suction. A steady normal shock is positioned by a shock holding plate 
(following [4]) with accuracy of typically ± 1 mm. An additional post-shock adverse 
pressure gradient is imposed using a subsonic diffuser, the angle of which was 
chosen to be 3° based on an (inviscid) estimate of the additional pressure gradient 
that this would impose although this could be varied if it proved to be necessary. 
The wind tunnel geometry is shown in Fig. 5. 

The flow is studied using a range of experimental techniques. Two-component 
laser Doppler anemometry (LDA) is used to measure both streamwise and wall- 
normal velocity components to within typically ±0.5 %. The flow is seeded with 
olive oil introduced into the flow in the settling chamber via a centreline seeding 
rake. Measurements were made on one side of the tunnel and reflected about the 
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Fig. 4 Cf over airfoil chord 
for different refinement levels 



Fig. 5 Experimental set up: 
(a) photograph of wind 
tunnel, the dashed circle 
shows the region of optical 
access; (b) plan view of 
tunnel floor showing suction 
slot arrangement 


a 

optical access 


suction 


shock holding plate 



shock 


diffuser 


centreline - the validity of doing this was checked periodically by performing 
symmetry checks. The surface pressure field is quantified to within ±5 % using 
pressure sensitive paint (PSP), which is calibrated in situ against readings taken 
from a number of static pressures tappings using a 16 channel NetScanner 9116 
series pressure transducer. Total pressure profiles were measured using a 15 hold 
pitot rake, with an accuracy of approximately 2 %. This could be mounted at various 
spanwise locations in the tunnel. Additionally, total pressure profiles close to the 
bump were calculated from LDA and PSP data using the assumptions that the 
pressure and stagnation temperature are constant across the boundary layer. This 
method is accurate to within ±5 %. 




































340 


K. Niibler et al. 


Table 3 Free stream conditions for pathfinder airfoil meeting the 
wind tunnel flow properties 


M 

a 

Transition position 

Turb. model 

0.76 

1.8° 

10 % chord length 

SST k -to 


2.3 Adaptation of Methods 

In order to produce relevant validation data, both the experimental and compu¬ 
tational set-ups were adapted to produce baseline (no control) conditions which 
agreed. Several parameters were varied in the computations: free-stream Mach 
number and angle of attack controlled the inviscid velocity in the region of interest, 
whilst the transition location determined the incoming boundary layer properties. To 
reduce computational cost, the coupled Euler-boundary layer code MSES [10] was 
used, with the best candidate solutions being fine-tuned using FLOWer as detailed 
in Sect. 2.1. The final values of the parameters are given in Table 3. 

Figure 6 shows a comparison of the incoming boundary layer profiles from 
experiment and CFD using the conditions outlined above, and the corresponding 
integral parameters are given in Table 4. These show good agreement, indicating 
that no modification to the wind tunnel boundary layer was required to achieve good 
agreement. 

The surface pressure distributions for experiment and CFD are presented in 
Fig. 7. Again, good agreement is observed, confirming the choice of diffuser angle 
in the wind tunnel set-up. It is noteworthy that these results have been achieved 
for a combination of airfoil parameters which are representative of conditions on a 
real wing. 


3 SCB Application: Validation and Results 
3.1 SCB Geometry 

A contour bump has been analysed using the methods described in the previous 
chapter. The longitudinal and lateral height profiles are given by elastic deflection 
formulae and shown in Fig. 8. Further details of the bump shape and discretisation 
may be found in [2]. The dimensions in Fig. 8 are scaled with the incoming boundary 
layer thickness So which was measured to 6 mm according to Table 4. In absolute 
dimensions, the bump is 150 mm long and 50 mm wide, with a crest height of 6 mm. 
The CFD airfoil chord length was set to 1 m. 

Two shock positions were examined: ‘on-design’, where the shock is 75 mm from 
the nose, and ‘off-design’ where the shock is 20 mm further downstream. In the 
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Fig. 6 Boundary layer 
profile upstream of shock: 
CFD clean airfoil data and 
wind tunnel measurment with 
Sun and Childs [9] turbulent 
boundary profile fitted 



Table 4 Boundary layer parameters upstream of the shock in the 
experiment and in the computation 



Po(kPa) 

Mshock 

<5 0 (mm) 

<5o (mm) 

Exp. 

170 

1.30 

6.0 

0.67 

CFD 

173 

1.30 

7.0 

0.67 


do (mm) 

H 0 

Re 8 * 

d 0 


Exp. 

0.53 

1.28 

27,650 


CFD 

0.51 

1.31 

25,670 



Fig. 7 Pressure distribution 

across the shock with diffuser q 7 

and in CFD 
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experiments this was achieved by moving the shock, whilst in CFD the bump was 
moved forwards on the airfoil. The presence of the SCB caused the shock to shift 
5 mm upstream on the airfoil and therefore the bump had to be moved 25 mm to 
match the relative shock position in the wind tunnel. A first analysis of the flow 
features can be found in [15]. 
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Fig. 8 Contour bump g 

geometry in longitudinal (a) 0 

and lateral (b) plane ' 

0 

b 

to 1 

0 
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3.2 Validation 

Figure 9 shows a comparison of computed surface streamlines and experimental 
surface flow visualisation results for both shock positions. 1 On design, a good agree¬ 
ment is observed between the experiment and CFD with only minor discrepancies, 
both with respect to the flow topology and the region of low shear stress over the 
SCB tail. The off-design separation topology, which may be classified as an ‘owl- 
face of the first kind’ [16], is correctly predicted in position by CFD, although is 
larger. This separation has been found to result from either positioning the bump too 
far forward on the wing (as was the case here) or from having too tall a bump. 

Further favourable comparison may be found in the surface pressure fields 
presented in Fig. 10. The general shape of the field is seen to be similar in both cases, 
with respect to the shock footprint as well as the development of a low pressure 
region just ahead of the main shock on the bump crest. The centreline distributions 
confirm the similarity in shape, although it is seen that the initial compression is 
lower for the airfoil than the experiment. This is possibly caused by the underlying 
curvature of the airfoil which means that whilst the uncontrolled shock strength 
(at the rear leg of the SCB shock system) is matched, ahead of the bump the flow 
has not yet reached M 00 = 1.3. Additionally the airfoil curvature is thought to 
reduce the effective ramp angle, further weakening the initial shock. 

It was found that the width of the computational domain (which sets the effective 
bump spacing) had a noticeable impact on the level of agreement attained between 
experimental and computational results. The computational results presented in 
Fig. 9 were obtained for a domain width of 0.1 c, which is similar to the wind tunnel 
width. For a domain width of 0.3 c, the region of low shear visible on the tail in 
Fig. 9a was found to be a separation at the trailing edge. This suggests that there is 
an appreciable interaction between bumps if the spacing (centreline to centreline) 



1 The perspective view in (b) is used as the oil accumulated at the foci in the wind tunnel experiment 
was smeared by tunnel shutdown, obscuring the topology. This image was taken from a high- 
definition video recorded during the tunnel run. 
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Fig. 9 Comparison of computed surface streamlines with experimental surface flow visualisation: 
(a) on-design, (b) off-design. The dashed line shows the shock position; in (a) the blue regions in 
the CFD image indicate low shear stress 



Fig. 10 Comparison of surface pressure fields obtained from experiment and CFD for on-design 
shock position: (a) centreline distribution; (b) experimental result obtained using pressure sensitive 
paint; (c) computational result 


is two bump widths or less, in this case allowing a more aggressive tail design than 
would otherwise be possible without inducing separation. 

A further point of note is that the separation in the off-design case is better 
predicted in this case than the corresponding results presented by Ntibler et al. [11]. 
It is thought that this may have been influenced by the increased post-shock adverse 
pressure gradient imposed by the new wind tunnel set-up, which was not present for 
the experiments quoted in that paper. 

In summary, a good agreement between computational and experimental results 
is achieved, which confirms that the method of adapting both wind tunnel and 
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CFD to model the same flow but with different approaches is viable and produces 
satisfactory and useful results. 


3.3 Some Elements ofSCB Physics 

The validated computational data can be used to improve understanding of flow 
structures found by wind tunnel studies for which certain parameters would be 
difficult to compute from the experimental data. The following section presents 
some preliminary results of an ongoing study on the effects of the tail geometry. 

Experimental measurements of total pressure in the wake are presented in Fig. 11. 
These show that just downstream of the bump, there is a high total pressure loss 
relative to the no control case, and thus using just these measurements (as has 
traditionally been done in experiments, [5,12]) it would be concluded that the SCB 
incurs a severe penalty in the boundary layer. Flowever, further downstream (towards 
the end of the diffuser) the total pressure is seen to have recovered such that there is 
a net gain in the lower boundary layer. Using these results, the conclusions on SCB 
viscous penalty would therefore be the opposite to before. 

Similar behaviour is evident in the computational solution. Figure 12 shows the 
development of the total pressure deficit, defined as 



where A is the area spanning the height of the computational domain and the width 
of the bump. A region of clear improvement in the deficit (evidenced by a decreasing 
value of 77) is seen which persists until around 100 mm downstream of the bump. 
Further downstream the deficit increases again, and this is thought to be due to 
the boundary layer growth in the increasing adverse pressure gradient towards the 
trailing edge of the wing. 

It is thought that this is brought about by streamwise vorticity, [15], the existence 
of which was postulated from velocity measurements in a plane 20 mm downstream 
of the SCB (reproduced in Fig. 13) which show a region of clear downwash. 
Experimental determination of vorticity was not possible due to limited optical 
access preventing measurement of the spanwise component of velocity. However, 
the presence of this structure is confirmed by the computational results, also shown 
in Fig. 13, where clear regions of vorticity are apparent. 

The vortical structure does not represent vortices in the traditional sense, but is a 
simple spanwise variation in the flow direction. That this effect can bring about the 
total pressure recovery is encouraging because, unlike with ‘traditional’ vortices, 
there is no corresponding upwash region and therefore the presence of a zone of 
increased total pressure deficit is minimised. 

Figure 14 shows how the magnitude of the vorticity varies in the streamwise 
direction via the circulation, defined as 
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No control 


SCB on-design 




Fig. 11 Total pressure measurements downstream of SCB in wind tunnel: (a) just downstream of 
the bump at x = 170 mm; (b) far downstream at x = 485 mm, corresponding to the wing trailing 
edge 


Fig. 12 Total pressure deficit 
variation in SCB wake - note 
that x = 150 mm 
corresponds to the back of the 
bump 



x (mm) 


r(x) = I co ■ d 2 x 

Jam 

The area A(x ) in this case is the semi-span of the computational domain. This 
is non-dimensionalised against the skin friction velocity u z (= y/r w / p) and h, the 
height of the bump, following a similar treatment by Ashill et al. [17] to enable 
the strength of the vorticty generated by the bump to be compared to that of vortex 
generators. 

Two peaks of circulation are observed, which correspond to vorticity generated 
by the curved shock structure; this curvature is evident in the surface pressure results 
presented in Fig. 10. These represent the peaks of vorticity found anywhere in the 
flow, although they are seen to be relatively transient, decaying quickly downstream. 
A particularly interesting feature is the increase in circulation between x = 100 and 
225 mm, suggesting that the vorticity observed in the wake which is responsible for 
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Fig. 13 Wake structure in 
plane 20 mm downstream of 
SCB for the on-design shock 
position. The velocities are 
LDA measurements made in 
the wind tunnel; streamwise 
vorticity is calculated from 
the CFD solution 


Fig. 14 Distribution of 
circulation generated by SCB, 
calculated from CFD data 
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the total pressure recovery is generated on the bump tail, confirming the suggestion 
by Colliss et al. [15]. 

The peak wake circulation generated is seen to be r/(u T h) ss 4. This is very 
weak compared to the typical vorticity generated by a vortex generator for devices 
of this height (where r / (u T h) is approximately 16 for VGs run subsonically, [17]). 
This is perhaps not surprising, since the SCB is not explicitly designed to generate 
strong vortices. However, the fact that this flow structure brings about measureable 
benefits is encouraging, suggesting the SCB’s viability as a potential method to 
control the boundary layer on the aft sections of wings. This remains subject to 
further research. 


4 Computational Resources 


The FLOWer code is very suitable for vector computing. It was however decided 
to use the HUTS' NEC Nehalem and Cray XE6 clusters for this study, although 
good experience with the NEC SX machines was made within this project earlier. 
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Table 5 Grid size on levels for convergence study 


Mesh size 

CPUs 

Nodes 

Iterations 

Job TaT [h\ 

Itert CPU /h 

5.78 mio. 

7 

1 

4,000 

12.25 

46.6 

5.78 mio. 

28 

4 

3,250 

2.67 

43.5 


The reasons are the high availability of the clusters and a cost efficient and flexible 
computing due to the lower CPU rates on the clusters. 

point number of biggest block 

balancing ratio = --- 

point number of total mesh/number of CPUs 

Computation details and parallelisation performance is shown in Table 5 for a 
level 1 grid as described in Sect. 2.1 (Meshing and acceleration). These computa¬ 
tions were conducted on the NEC Nehalem cluster. As a structured mesh is used, 
reasonable block splitting is necessary to balance the load of the different CPUs. 
Two different splitting levels are compared showing a parallelisation loss of around 
7.2 % for the higher splitted mesh. Note, that both meshes had a similar balancing 
ratio of 1.089. The lower number of iterations for the higher splitted mesh was found 
to be sufficient for satisfying convergence of the computation in the considered case 
and is independent of the block splitting. 

To identify good bump designs, automated shape optimizations are performed, 
where around 80 consecutive computations are necessary. The overall turn around 
time for such an optimization can be reduced from around 1 month to around 1 week 
with the higher splitting of the mesh. 


5 Conclusions 

A joint CFD and wind tunnel study was conducted to develop a method of 
performing detailed studies of the flow physics of shock control bumps. In order to 
make the comparison possible, the free stream conditions in CFD were adapted so 
that the boundary layer properties upstream of the shock/boundary layer interaction 
on the clean airfoil matched those of the wind tunnel inflow. Additionally an 
additional adverse pressure gradient was imposed in the wind tunnel using a post¬ 
shock diffuser, modelling the effect of airfoil curvature. Good agreement between 
the methods was achieved with respect to both considerations in the region of 
interest. 

A contour SCB was tested both experimentally and computationally using the 
new set-up for two different shock positions relative to the bump crest. CFD 
and wind tunnel results showed good agreement with each other, including the 
separation behaviour for the ‘off-design’ shock position. This validation of the CFD 
results, especially the separation prediction, gives confidence in the computational 
results for varying flow conditions, allowing future studies of parametric variation 
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effects on the flow to be studied quickly and relatively inexpensively. Additionally, 
it was shown that typical flow conditions on a transonic airfoil can be replicated in 
the wind tunnel, allowing relevant further research into detailed flow physics. 

Finally, computational results were used to examine the behaviour of the flow and 
confirm some suggestions made by previous publications from the current work. In 
particular, the vortical wake structure produced by the bump was examined. The 
vorticity was shown to be generated by the bump tail and, although weak compared 
to other forms of boundary layer control, is seen to bring about a measureable total 
pressure recovery in the lower boundary layer further downstream. This suggests 
that the SCB could be used as an effective form of boundary layer control for wings, 
although further research is required into this concept at present. 
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Water Droplet Flow Paths and Droplet 
Deposition in Low Pressure Steam Turbines 


J. Starzmann, M.Y. Casey, and J.F. Mayer 


Abstract The complex three-dimensional two-phase flow in a low pressure steam 
turbine is investigated with comprehensive numerical flow simulations. In addition 
to the condensation process, which already takes place in the last stages of steam 
turbines, the numerical flow model is enhanced to consider the drag forces between 
the droplets and the vapour phase. The present paper shows the differences in the 
flow path of the phases and investigates the effect of an increasing droplet diameter. 
For the flow simulations a performance cluster is used because of the high effort for 
such multi-momentum two-phase flow calculations. In steam turbines the deposition 
of small water droplets on the stator blades or on parts of the casing is responsible 
for the formation of large coarse water droplets and these may cause additional 
dissipation as well as damage due to blade erosion. A method is presented that 
uses detailed CFD data to predict droplet deposition on turbine stator blades. This 
simulation method to detect regions of droplet deposition can help to improve the 
design of water removal devices. 


1 Introduction 

The development of steam turbines reaches back to the nineteenth century where 
Parsons and at the same time de Laval worked on the first steam turbine concept 
for industrial use. Today steam turbines are very important in energy conversion, 
because for 70% of world wide electricity generation a steam turbine is used, 
see e.g. [17]. Especially in new industrialised countries the need of steam power 
plants will even rise in order to meet the high demand of energy in these countries. 
Classically fossil fuels are used to generate steam which is then expanded in the 
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Fig. 1 Wetness formation in 
steam turbines and the 
corresponding expansion line 
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steam turbine. But besides the fossil fuels other heat sources such as biomass, 
geothermal or solar-thermal energy can also be used to vaporise the working fluid, 
so that steam turbines will retain their important role in energy conversion, even in 
an environment with renewable energy sources. 

The present paper investigates a special but well-known issue in steam turbines. 
Steam expansion in almost all low pressure steam turbines reaches saturated steam 
conditions even before the condenser is reached. At the outlet of the turbine already 
8 to 16% wetness exists which leads to different kinds of problems. In addition 
to enhanced corrosion and energy dissipation a serious danger for operational 
safety is given by droplet erosion of the blades. The wet steam flow is investigated 
numerically with the aim to improve modelling methods for this flow and to enhance 
the understanding of the complex three-dimensional two-phase flow phenomena that 
take place in low pressure steam turbines. 

In Fig. 1 the process of wetness formation in a steam turbine is illustrated on the 
left, and on the right the associated expansion process is shown in a enthalpy-entropy 
diagram. The most important part of the wetness losses is called thermodynamic 
wetness loss because it is caused by an irreversible heat transfer during the non¬ 
equilibrium condensation itself. The considerable entropy increase after droplet 
formation is visible in the h-s-diagram. The wetness losses that are generated 
downstream of this zone of first condensation are lower but also noticeable. 

Condensation in low pressure steam turbines mainly happens in terms of droplet 
condensation. Thus a two-phase flow exists which consists of saturated steam and 
dispersed small water droplets. Between the droplets and the vapour a second source 
of loss occurs, known as drag loss, because large droplets are not able to follow the 
steam path exactly and thus exert a drag force on the steam. 
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Fig. 2 Velocity triangles of steam and coarse water {left), example for blade erosion [14] (right) 


In principle the third important loss is caused by droplets which deposit on 
the stator blades. These droplets form thin water films or rivulets on the blade 
surface which are transported to the trailing edge of the blade due to the acting 
shear forces on the water surface. At the trailing edge this water film breaks up and 
coarse water droplets with diameters between 10 and 500 |xm are generated, as also 
outlined in Fig. 1 . The distance between the stator trailing edge and the rotor leading 
edge is too short that these big droplets can be accelerated from their initially very 
low velocities (similar to the water film velocity) to the considerable higher steam 
velocities. The velocity triangles of the steam flow and the droplets are compared to 
each other in Fig. 2. It is shown that the droplets strike under high negative incidence 
on the rotor blades. In addition to further drag losses and the momentum exchange 
with the rotor (known as braking losses), the coarse water droplets lead to erosion. 
An example of erosion damage in a low pressure steam turbine is given in Fig. 2. 
The picture indicates that droplet erosion mainly takes place at higher blade radius 
where the impact velocities on the rotor blades increase to about 300 to 600 m/s. 


2 Objectives of the Project 

Within the research initiative “Kraftwerke des 21. Jahrhunderts” a project has been 
initialised to investigate the condensation effects in the flow field of low pressure 
steam turbines. An important aim of this project is to enhance knowledge about the 
complicated physics in rapidly expanding and condensing flows and to improve the 
prospects of numerical flow modelling. 

Results concerning the influence of the rotor-stator induced unsteadiness on 
the nucleation process has already been published in the last workshop, see [16]. 
This study as well as previous investigations of other authors assumes that the 
droplets follow the steam flow without any slip. The present paper shows currently 
achieved results concerning the influence of the friction between the droplets and the 
vapour phase on the flow field of a low pressure steam turbine. The present paper 
also focuses on the prediction of droplet deposition that causes large coarse water 
droplets which are liable for braking losses and erosion damage. 
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3 Developments in Numerical Wet Steam Flow Modelling 

In recent decades specific wet steam models to calculate the droplet forma¬ 
tion (nucleation) and droplet growth were implemented in one-, two- and three- 
dimensional CFD codes. The two-phase flow models differ in the way how wetness 
equations are considered. The early models follow a Euler-Lagrangian approach, 
where the conservation equations of the vapour phase is treated in an Eulerian 
frame of reference and the additional equations related to the wetness formation 
are integrated along the streamlines in a Lagrangian way. A non-viscous two- 
dimensional model was published by Bakhtar [1] and a three-dimensional method 
that also considers the viscous effects within the vapour phase was proposed by 
Gerber [8]. Modern three-dimensional investigations of the two-phase flow in steam 
turbines are realised with RANS solvers using an Euler-Euler multi-phase approach. 
The works of Heiler [13], who also spent much work in validation of the specific 
wet steam models and Wroblewski [19] can be referenced to here. A more detailed 
overview about theoretical treatments for wet steam flows is given by Bakhtar [2]. 


4 Multi-momentum Flow Model 

Condensing flow in the investigated model low pressure steam turbine is calculated 
with a special two-phase model implemented in Ansys CFX. This model was 
developed by Gerber and a first three-dimensional simulation of a multi-stage 
turbine using this model was realised together with the Institute of Thermal 
Turbomachinery (ITSM), see [10]. For further details about the numerical model 
the interested reader should refer to the publications of Gerber et al. [8-10] and to 
[16] where the models and their implementation are described. 

Recently the multi-phase RANS solver was enhanced to consider the drag 
between the droplets and the vapour phase, see [9]. In addition to the momentum 
equation for the vapour phase this model solves its own momentum equation for 
the droplet phase (multi-momentum model). The momentum equations for the 
continuous (c) vapour phase and the dispersed (d) droplet phase are given as follows, 
wherein a is the liquid volume fraction, u the velocity, p is the pressure and x l} the 
viscous stress tensor. 


3 (apui) i | 9 (apujUi) 
3 1 + 3xj 


3 (oipu.[) c 3 (apujUi) 
3 1 + dx\ 



d=l 


d=l 
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On the right hand side of equations (1) and (2) the momentum exchange between 
the phases due to the condensing mass flow m c n and the drag force acting between 
the phases is calculated. The source term .SV.d is given by the drag force on a single 
droplet times the number of droplets N. 



(3) 


A p is the projected surface of a droplet with the radius r and equates to A p = nr 2 . 
The drag force that is acting between the phases depends on the drag coefficient 
which is modelled by the well-known drag law of Schiller-Naumann [15] for 
spherical particles. 


C D = — (l + 0.15Rep 687 ) 


(4) 


For large particle Reynolds numbers (over 1,000) the drag value is limited to a 
value of 0.44 because in this regime the drag becomes independent of the Reynolds 
number, see e.g. [4] . 


5 Influence of Inter-phase Drag on the Flow Field 

The ITSM steam turbine is a scale model of a modern three-stage turbine design. It 
is scaled by a factor of approximately 4. The focus of the present work is on the last 
stage (S3/R3) of the turbine because wet steam conditions already exist at the inlet 
(plane E30) of this stage, as Fig. 3 shows. 

In the case of a model steam turbine the fog droplets that are generated by 
homogeneous nucleation are very small. The circumferential averaged mean droplet 
size at the inlet of the last stage is shown over the blade height in Fig. 4, where it 
can be seen that the typical droplet size is 0.1 to 0.3 |xm. Such small droplets are 
able to follow the steam path very well and, as they travel at steam velocity, no drag 
loss is generated. This situation changes as the droplet sizes become larger and are 
less able to follow the steam flow. For this reason an additional investigation of the 
last stage of the turbine was made to clarify the size of the droplets for which the 
situation changes. Several calculations of the flow in the last stage are performed. 
The droplet diameter at the inlet is increased systematically, whereas the liquid mass 
flow is kept constant. The inlet conditions for the default inlet diameter (Dxl) and 
the four times larger droplet diameter distribution (Dx4) are shown in Fig. 4 together 
with the wetness fraction y. 

Results of simulations with the default inlet diameter distribution (Dxl), a four 
(Dx4), eight (Dx8) and sixteen (Dxl6) times larger droplet diameter distribution 
are shown in Fig. 5. The meridional streamlines of the vapour and the droplets for 
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Fig. 3 Sketch of the 
investigated ITSM low 
pressure steam turbine 



Fig. 4 Droplet diameter size 
and wetness fraction at the 
inlet of the last stage (S3/R3) 


wetness fraction y | | 



0 0.2 0.4 0.6 0.8 1 1.2 


mean droplet diameter D [jun] 


the last stage of the turbine show that no noticeable difference in the flow path occurs 
until the (Dx8)-calculation. 

In the following section the influence of the droplet size on the deposition of 
droplets on the stator blade S3 is studied. But already from the streamlines on 
a surface at constant blade height, see Fig. 6, a tendency can be seen. For the 
calculation with the default inlet diameter (Dx 1) even the strong turning of the flow 
within the blade channel does not lead to a different flow path of the vapour and 
the droplets. As the droplet sizes are increased the calculation (Dx8), with 8-times 
bigger droplets, is the first that shows a considerable effect. The absolute averaged 
droplet diameter at 50 % span height at the inlet of the stator S3 is then 0.7 |im. This 
can be obtained from Fig. 4, if the (Dxl)-value at 50% span is multiplied by the 
factor of 8. 
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Fig. 5 Meridionale streamlines for steam (black) and droplet trajectories (blue with symbols) 


6 Droplet Deposition in Steam Turbines 

For steam turbines that operate completely under wet steam conditions information 
about droplet deposition and coarse water formation are already needed during 
the design process. The turbine designer has to decide if special water removal 
devices at the casing or even hollow stator blades with water removal suction slots 
are needed to avoid erosion problems, see Fig. 7. In the following section a short 
overview about the mechanisms of droplet deposition is given. Subsequently the 
method to predict three-dimensional droplet deposition together with the numerical 
results are presented. 


6.1 Theory of Droplet Deposition 

Particle deposition mechanism in low pressure steam turbines can be divided into 
the laminar phenomenon “inertial depositio” caused by streamline curvature (see 
Fig. 2) and “turbulent diffusional deposition”. Droplet deposition by turbulent flow 
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Fig. 6 Vapour streamlines 
(black) and droplet 
trajectories (blue with 
symbols) at 50 % span 
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Dx8 



Dx4 



Dxl6 


effects were investigated in numerous earlier studies where turbulent pipe flows 
were considered. An overview of the literature and the theoretical aspects can 
be taken from Guha [11], more specific information related to steam turbines are 
recapitulated by Crane [6]. 

In the following some theoretical principles about droplet deposition are given, as 
this is necessary to understand the results for a steam turbine stator blade presented 
later. A first important quantity is the relaxation time which is the time a droplet 
needs to be accelerated to the steam velocity. According to Gyarmathy [12] for 
small droplets with a diameter D the kinematic relaxation time r can be obtained 
from Stokes drag law with some adjustments based on the Knudsen number Kn. 


r 


PdD 2 
18 u c p c 


(1 + 2.7Kn) 


(5) 


A non-dimensional kinematic relaxation time can be defined with the friction 
velocity u* = sj r w /p c whereas r w is the wall shear stress. 
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Fig. 7 Example for water removal devices in low pressure steam turbines (Taken from [7]) 


V c 

The useful deposition velocity V is calculated by the droplet mass flux to the wall 
/ w per unit area divided by the bulk particle density y ( mass of droplets per volume). 
Again the friction velocity u., can be used to get a non-dimensional deposition 
velocity V+. 

K = ^ ( 7 ) 

yu„ 

From experiments on turbulent pipe flows the behaviour of the non-dimensional 
deposition velocity with changing particle relaxation times can be reached, an 
example is shown in Fig. 8. Three different regions can be detected from the 
experimental data. The first is called “turbulent particle diffusion regime” because 
the transport mechanism of very small droplets across the viscous boundary to the 
wall is due to Brownian and also eddy diffusion. With rising particle diameter inertia 
becomes relevant and the interaction with the turbulent eddies in the near wall region 
results in a significant increase in droplet deposition. The second regime is therefore 
called “eddy-diffusion-impaction regime”. First investigations of Caporaloni [3] 
suppose that a mechanism called “turbophoresis” is responsible for the increase of 
deposition in this second regime. Turbophoresis means that particles are transported 
into regions with lower turbulence levels which is comparable to the general 
diffusion mechanism, where a diffusive flux of particles occurs towards lower 
particle concentrations. A further growth in particle size, and thus in relaxation time, 
leads to a dominance of inertial forces on the particle deposition. Larger particles 
are less affected by small-scale turbulent eddies and thus the third region in Fig. 8 is 
called “inertia-moderated” regime. 

It has to be mentioned that prediction of droplet deposition in steam turbines is 
hindered by the fact that the droplet relaxation times can be expected within the 
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Fig. 8 Deposition velocity 
for different relaxation times 
according to Guha [111 



O 

Small particles 


iog 10 (x + ) 


o 

Large particles 


second regime, and in this regime the droplet deposition is highly sensitive to small 
changes in droplet size. 


6.2 Method to Predict Droplet Deposition 

Different methods to predict droplet or particle deposition for simple pipe flows 
exist, see Guha [11]. First serious investigations on droplet deposition on turbine 
blades due to turbulent effects were carried out by Crane [5]. Yau and Young [20] 
adjusted a model from Wood [18] for use on turbine flows. The model of Wood was 
originally developed for the prediction of particle deposition in turbulent pipe flows. 
Experimental data used in this semi-empirical approach to fit a set of mathematical 
equations to achieve a better agreement to measured deposition rates. 

For the present investigation the model of Yau and Young [20] is used to 
determine droplet deposition on the stator blades of the last stage turbine that 
is investigated here. The approach was programmed in an external routine and 
embedded in the post-processing framework of Ansys-CFX, which enables the 
prediction of droplet deposition without any extra parallel computing time. How¬ 
ever, comprehensive multi-momentum CFD simulations are needed to provide the 
database which is needed for the prediction of droplet deposition. 

A key aspect of the present work compared to the original approach of Yau and 
Young [20] from the year 1987 is that precise information about the near wall flow 
field is available from the simulation. In the late 1980s the friction velocity, for 
example, had to be calculated by 2D boundary layer computation methods which 
were the best available method at that time. The necessary input data for this method 







Water Droplet Flow Paths and Droplet Deposition in Low Pressure Steam Turbines 


361 


were obtained from two-dimensional non-viscous blade-to-blade calculations. In 
the present analysis flow data from three-dimen-sional viscous two-phase flow 
simulations can be used, and as the simulations have an average y+ of 4 more 
details of the boundary layers are resolved. 

The present CFD simulations consider the friction between droplets and steam 
(see Sect. 4), thus the deposition due to inertial effects can be evaluated directly 
from the CFD results. In conclusion within the present approach a three-dimensional 
prediction of the droplet deposition due to inertial as well as turbulent diffusion is 
possible. 


6.3 Results of Droplet Deposition 

Droplet deposition on the stator S3 has been predicted for the flow situation given by 
calculations already presented in Sect. 5. These calculations only differ in the inlet 
droplet diameter distributions whereas the mass flow of the vapour and the liquid 
phase is kept constant. The diagrams in Fig. 9 show the deposited mass flow related 
to the overall liquid mass flow at the inlet of the stator S3. From the curves in the 
left picture it can be seen, that for small droplet sizes deposition due to turbulent 
diffusion dominates. However, if the inlet droplet size is considerably increased 
the inertial deposition becomes more and more important. Of course, the overall 
deposition is also increased for coarser droplet distributions. The right diagram 
of Fig. 9 gives the difference in droplet deposition between the suction and the 
pressure surface of the stator blade and shows that due to the inertial effects the 
major deposition occurs on the pressure surface. 

Further details about the three-dimensional characteristics of droplet deposition 
are presented in Fig. 10. The two contours present the predicted absolute deposition 
mass flow per unit area for the pressure side (PS) and the suction side (SS) of the 
blade. First of all it can be seen that droplet deposition mainly occurs around the 
leading edge and also on the pressure surface near the trailing edge (TE). Above 
all the deposition around the leading edge is due to inertial deposition because 
especially large droplets are not able to follow the high curvature of the vapour 
flow path in this region and strike directly onto the blade. 

In the diagrams on the right hand side of Fig. 10 a relative deposition <5 rti 
is defined over the blade height which also helps to identify the relationship 
between properties of the flow field and droplet deposition characteristics. Relative 
deposition means that the absolute deposited mass flow, which is detected along a 
constant blade height, is divided by the overall deposited mass flow. The white bars 
represent the sum of inertial and turbulent deposition and the grey bars are related 
to the inertial deposition only. Obviously the droplet deposition over the span height 
is related to the droplet diameter distribution. In addition to the diameter also the 
wetness fraction itself has to be taken into account, however, in the present case the 
wetness fraction is nearly constant, see Fig. 4. 
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inlet diameter multiplication factor [-] 


inlet diameter multiplication factor [-] 


Fig. 9 Relative droplet deposition on S3 with increasing inlet droplet size 


7 Computational Efforts 

The high performance cluster NEHALEM of the HLRS in Stuttgart has been 
used for the current computations with the flow solver Ansys CFX. Several issues 
contribute to the need to use a computing cluster for the simulations. Firstly the 
standard non-equilibrium condensation model requires an additional numerical 
effort compared to a single-phase model because extra equations have to be solved. 
In addition, for the present investigation the original two-phase model was further 
enhanced to consider the drag between the phases which leads to a still more 
extensive computational model. Finally, in order to reach a sufficient residual level 
(RMS-residuals < 1 0 -5 ) the timestep has to be reduced compared to the standard 
non-equilibrium model. In this context also the numerical grid resolution has to be 
mentioned. A block structured hexahedral grid with an O-grid topology around the 
blades was used. For the calculation of droplet deposition an additional refinement 
in radial direction was necessary to reach a good quality of the results which leads 
to a grid with 3.3 million nodes for the last stage of the turbine. 

A single multi-momentum calculation of the last stage can be realised at the 
ITSM PC-cluster with 24 CPUs (2.7 MHz) and takes about 3 days. An important 
speed-up by a factor of 5.5 has been achieved by using the NEHALEM cluster, 
and this is due to two effects. First the number of CPUs can be increased from 
24 to typically 56 CPUs, but also the optimised computer architecture leads to a 
higher computing performance in this parallel simulation. Within the project about 
30 simulations were made because different wet steam models and three different 
turbine operation conditions were included in the investigation. On this basis the 
pure numerical calculations would take between 2 and 3 months on a simple PC- 
cluster and a simulation time of only 2 weeks is needed on the high performance 
computing cluster. 
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Fig. 10 Deposited mass flow rate / w on S3 (left) and relative droplet deposition 5 re i (right) 
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8 Conclusion 

The present work investigates the flow field in the last stage of a three stage 
model steam turbine by means two-phase flow simulations with Ansys CFX. The 
numerical method considers the condensation process that takes place in the flow 
path of the turbine. The main improvement compared to two-phase flow simulations 
in the past is that the drag between the vapour and the droplet phase is now also 
considered within the model. It has been shown that for normal operation conditions 
the condensed droplets are so small that almost no friction loss occurs. Furthermore 
the influence of increasing droplet diameter on the flow path of the liquid droplet 
phase compared to the general flow path of the vapour phase is discussed. 

In steam turbines a certain percentage of the liquid droplets deposit on the stator 
blades which is responsible for the formation of large coarse water droplets. These 
big droplets provoke additional losses (braking and drag losses) and in addition they 
can lead to a reduced operational safety due to blade erosion. The work presented 
here provides for the first time a method that is able to predict droplet deposition 
due to inertial and turbulent diffusion effects for a three-dimensional turbine blade. 
Such numerical results can be helpful during the time-scale of the design process 
because the engineers need to decide if water extraction methods are needed and 
where they should be positioned. 
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MPC and Coarray Fortran: Alternatives to 
Classic MPI Implementations on the Examples 
of Scalable Lattice Boltzmann Flow Solvers 


Markus Wittmann, Georg Hager, Gerhard Wellein, Thomas Zeiser, 
and Bettina Krammer 


1 Introduction 

In recent years, more and more parallel programming concepts have emerged as 
alternatives or improvements to the well established MPI concept. Arguments for all 
the new parallel languages or alternative communication frameworks are typically 
the increasing number of cores in modern systems and the hierarchical memory 
structure in clusters of multi-socket multi-core compute nodes. 

Hybrid parallelization using OpenMP within a (ccNUMA) shared memory node 
and MPI between the compute nodes is a long known concept. However, correct 
placement of processes and efficient synchronization remain open issues. Today, 
there are only limited examples where a hybrid OpenMP/MPI parallelization shows 
a significant performance boost compared to pure MPI. 

The MPC (Multiprocessor Computing) framework [9] is a unified parallel run¬ 
time for the MPI and OpenMP parallel programming models. Its unified design may 
provide better performances on hybrid MPI/OpenMP codes. The PGAS language 
Coarray Fortran (CAF) has the appeal to be part of the next Fortran standard. 

Therefore, we started to evaluate both concepts, MPC and Coarray Fortran, using 
very simple (communication) benchmarks but also lattice Boltzmann flow solvers. 
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2 Related Work 

Application performance optimization on shared-memory systems through the MPI 
library has been researched for a long time. A strategy is to execute MPI ranks inside 
threads instead of processes, which should make scheduling more lightweight and 
introduce potential for optimizations due to the shared address space. Besides MPC, 
this is investigated in TOMPI [3], TMPI [13], and AzequiaMPI [2,8]. In contrast to 
MPC only MPI was targeted, whereas MPC unifies the runtime system of MPI, 
OpenMP, and POSIX threads. 

An evaluation of Coarray Fortran in the context of a lattice Boltzmann flow solver 
has also been carried out by Hasert et al. [6]. They compare different paralleliza¬ 
tion approaches with CAF and MPI. However, they come to the conclusion that on 
current systems the MPI parallelization yields the best performance. 


3 MPC: The Multiprocessor Computing Framework 

The design goals of the MPC framework are to improve the scalability and 
performance of applications running on clusters of (large) multi-processor/multi- 
core NUMA nodes and to reduce the memory footprint of the runtime system. MPC 
provides a unified runtime with its own implementations of the POSIX threads, 
OpenMP 2.5, and MPI 1.3 (with support for MPI_THREAD_MULTIPLE level) 
standards [ 1,9, 10]. MPC uses a patched version of the GNU compiler to circumvent 
the GOMP runtime and call the MPC OpenMP runtime instead. Thus, existing MPI 
or OpenMP codes only have to be recompiled with MPC. 

MPC uses its own threading library, and its scalable and NUMA-aware memory 
allocator. The main difference between MPC and other typical MPI libraries is 
the process virtualization technique: MPI ranks are executed inside threads instead 
of processes. The fact that MPI ranks on the same node share the same virtual 
memory space can provide performance benefits for example for intra-node MPI 
communications. 

MPI-style wrappers (mpc_cc, mpc_cxx, or mpc_f 77, and mpc_run) are 
provided to facilitate the use of MPC. Because of the process virtualization 
technique, global variables have to be privatized so that existing MPI codes behave 
as expected. Without this modification, global variables would be shared by all MPI 
ranks on the same node leading to an incorrect behavior in most cases. A patched 
version of GCC 4.4 can automatically do the required transformations using the 
command line flag - fmpc-privatize. MPC can also be used with any compiler 
for MPI-only code. In that case, the user may have to privatize global variables 
manually. However, for OpenMP or hybrid MPI+OpenMP codes, MPC can only be 
used with its patched GNU compiler. 

The startup mechanism of MPC currently only supports a limited number of 
batch systems (originally only Slurm; in the meantime also Torque). 
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Fig. 1 PingPong bandwidth measured with Intel MPI Benchmarks (with -off_cache option 
set). "LD” means “locality domain”, i.e. cores sharing a memory interface. In the legend, "Intel” 
denotes Intel compiler with Intel MPI and “MPC” denotes MPC with the patched GCC 4.4. 
(a) LRZ’s MPP cluster (b) RRZE’s Westmere cluster 


For the benchmarks the Intel compiler 12.1 in combination with Intel MPI 4.0.3 
(labeled as “Intel”) and MPC 2.3.1 with the patched GCC 4.4 (labeled as “MPC”) 
were used. 

Figure 1 shows results obtained on LRZ’s MPP cluster (2 socket nodes with octo- 
core AMD Opteron 6128HE processors and QDR Infiniband between the nodes) 
and on RRZE’s Westmere cluster (2 socket nodes with hexa-core Intel Xeon X5650 
processors and QDR Infiniband between the nodes). As expected, the intra-node 
bandwidth as measured by the PingPong benchmark of the Intel MPI benchmarks 
(IMB, version 2.3.2) [7] is higher for MPC than with typical MPI implementations. 
The inter-node results of MPC are less convincing. For the PingPong benchmark the 
-of f_cache option was used to avoid cache effects [5]. 

A severe limitation of common MPI implementations is that non-blocking 
communications (e.g. MPI_Isend and MPI_Irecv) are unfortunately not asyn¬ 
chronous, i.e. there is only progress while MPI library code is executing. On Cray 
XT systems with SeaStar interconnect asynchronous data transfer did occur [11], 
The newer Cray XE6 systems with Gemini interconnect today no longer allow 
overlapping the data transfer for large messages. The overlapping capabilities where 
investigated by a simple overlap benchmark as described in [5]. Using Isend- 
Recv, MPC does asynchronous data transfer, however, there is no overlap of 
communication with computation in case of Send-Irecv or Isend-Irecv as 
shown in Fig. 2. Overlapping capabilities may improve in the next version of MPC 
(2.4.0) [4], 

Applying hybrid MPI/OpenMP to our large-scale lattice Boltzmann flow solver 
did not give any benefit. Best performance is obtained with pure MPI as shown in 
Fig. 3. In pure MPI mode, Intel MPI with Intel Compiler (“Intel”) performs slightly 
better than MPC MPI with patched GCC (“MPC”), but this may be due to the 
compiler, as Intel MPI with GCC (“GCC”) shows almost the same performance 
as “MPC”. In MPI/OpenMP mode, Intel MPI coupled with Intel OpenMP is 
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a b 



Fig. 2 Ability of MPI implementations for asynchronous communication with non-blocking 
point-to-point calls: A fixed amount of data (70 MB) is always transferred while the amount of 
work is increased. Ideally, small amounts of work should completely be hidden by the constant 
communication time. The Intel MPI curves for all three communication types collapse. In the 
legend ‘‘Intel” denotes Intel compiler with Intel MPI and “MPC” denotes MPC with the patched 
GCC 4.4. (a) LRZ’s MPP cluster, (b) RRZE’s Westmere cluster 



Fig. 3 Performance of ILBDC flow solver simulating a randomly filled packed bed catalyst reactor 
on RRZE's Westmere cluster using MPI and hybrid MPI/OpenMP. The simulation domain consists 
of roughly 150 million fluid nodes. In the legend ‘‘Intel” denotes Intel compiler with Intel MPI, 
“MPC” denotes MPC with the patched GCC 4.4, and “GCC” denotes a standard GCC 4.4 with 
Intel MPI 


significantly better than MPC with the patched GCC compiler, and hybrid Intel 
MPI with GCC compiler is still better than hybrid MPC with GCC: the fact that 
the MPI and OpenMP runtimes are unified in MPC does not seem to offer a 
performance benefit in this case. Improving the OpenMP runtime and support for 
hybrid applications is an ongoing effort in MPC. 
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4 Coarray Fortran 

Several Coarray Fortran implementations do exist, e.g. Cray, Intel (since the 12.x 
compiler series), Rice, Open64, g95. 

We have not been able to get the Rice or Open64 compiler running in any 
reasonable way. Development of g95 seems to have stopped. The Intel compiler 
can be used for program development due to the high quality of the generated 
code; however, (multi-node) performance even on Infiniband clusters is not yet 
ready for productive use in the current version (12.1 update 4). The only platform 
with reasonable compilers and hardware support is the Cray XE6 with Gemini 
interconnect. 

An often heard argument for Coarray Fortran is the ease of converting a non¬ 
parallel program into a distributed-parallel program. However, this simplicity is as 
ambivalent as with OpenMP. In our tests, good performance could only be obtained 
if an MPI-style parallelization concept was followed. 

Details can be found in the Master Thesis of Klaus Sembritzki [12] and will be 
published elsewhere. 


5 Conclusion and Future Work 

There is still no real alternative to MPI on the horizon. But nevertheless, in the near 
future, we plan to investigate GPI (Global Address Space Programming Interface) 
from FHG/IWTM as an other PGAS approach as GPI shall be extended by IWTM 
within the BMBF project GASPI to include fault tolerance features which are 
important for the FETOL BMBF project we are participating in. 
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Part Y 

Transport and Climate 

Prof. Dr. Christoph Kottmeier 


The HPC requirements for simulations of natural systems are rapidly increasing. 
This is reflected by, e.g., the large storage amount and CPU-times of regional climate 
simulations in the projects HRCM: “Modelling Near Future Regional Climate 
Change for Germany and Africa” (KIT Karlsruhe) and LUCCI: “Regional 
Climate Simulations for Southeast Asia” (KIT Garmisch-Partenkirchen). These 
studies provide assessments of the capabilities of regional climate models in 
simulating the observed climate of the last decades and the future in various regions 
on Earth. 

Regional climate change projections of this kind are used more and more 
to enable estimates of climate change consequences in various economic and 
social sectors. HRCM addresses the needs of hydrology and flood management 
in medium size river catchments of basic land-atmosphere exchange studies and 
climate predictability in Europe and Africa. LUCCI aims at providing information to 
assess changes in future land-use and agricultural productivity in Central Vietnam. 

Other projects like WRFCLIM (University Hohenheim) or MIPAS (KIT Karl¬ 
sruhe) also reflect very well the high importance of the HLRS and SSC computing 
facilities for highly visible research programmes in actual research in meteorology 
and oceanography. They were not chosen for presentation, since they were described 
in HLRS-reports before. 

The project AGULHAS (“The Agulhas System as a Key Region of the Global 
Oceanic Circulation”, IFM-Geomar Kiel) applies a model hierarchy which is based 
on the proven oceans model AG01 and the tested high resolutions model INALT01. 
For a full representation of the AGULHAS dynamics a high spatial resolution with 
grid scales less than 10 km is achieved. The new focus of recent runs was on the 
climate change signal on the Agulhas leakage of Indian Ocean water to the Atlantic. 


Prof. Dr. Christoph Kottmeier 

Institut fiir Meteorologie und Klimaforschung, Karlsruher Institut fur Technologie, 
Wolfgang-Gaede-Str. 1, 76131 Karlsruhe, Germany 



374 


V Transport and Climate 


The simulations show an intensification of this process causing a salinification of 
upper waters in the South Atlantic. 

At much higher model resolution and for much smaller domains, the project 
TIGRA (“Numerical Investigation of stratified turbulence”) addresses Direct 
Numerical Simulation (DNS) and Large Eddy Simulation (LES) of solidly stratified 
turbulent fluids. Simulation of geophysical turbulent flows as mentioned above 
requires a robust and accurate subgrid-scale turbulence modeling. 

It could be shown that the implicit turbulence model ALDM correctly predicts 
the turbulence energy budget and the energy spectra of stratified turbulence. This is 
surprising, since dissipative structures are not resolved on the computational grid. 
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Abstract The modelling of future regional climate change for Germany and Africa 
using the regional climate model COSMO-CLM (CCLM) comprises basic studies 
on how temperature and precipitation are affected in general, but also specific impact 
studies whose results can be used for the planning of adaptation and mitigation 
measures. 

For Africa ERA-Interim driven simulations have been carried out within the 
CORDEX framework. Evaluation studies using two different horizontal grid spac- 
ings (0.44° and 0.22°) did not show significant differences in the results. Fur¬ 
thermore, these simulations have been used to perform the transition from NEC 
computing systems to the new CRAY XE6, and to investigate the impact of this 
change on the model results, which is very small. 

The impact of likely heavier summer rainfall on soil erosion in southern Germany 
is investigated within the KLIWA project “Bodenabtrag durch Wassererosion in 
Folge von Klimaveranderungen”. The corresponding simulations are performed 
with very high horizontal resolution (7, 2.8, and 1 km). Results show an added- 
value of the convection-permitting scale (2.8 km) in comparison with coarser spatial 
resolution (7 km). The high-resolution simulation also represents well the frequency 
of high and very high precipitation days. 

In order to answer the important question how to account for uncertainties in 
regional climate projections, it is necessary to understand the sensitivity of the 
model results to various processes, e.g. the initialisation of soil moisture, and to 
create multi-member ensembles of climate simulations. The latter aspect is tackled 
within the Helmholtz Climate Initiative REKLIM (Regional Climate Change). In 
order to capture uncertainties related to the positioning of synoptic systems, a 
multi-member ensemble of climate simulations is generated by introducing small 


H.-J. Panitz (El) ■ G. Fosser ■ R. Sasse ■ A. Sehlinger • H. Feldmann • G. Schadler 
Institut fiir Meteorologie und Klimaforschung, Karlsruher Institut fur Technologie (KIT), 
Eggenstein-Leopoldshafen, Germany 
e-mail: hans-juergen.panitz @kit.edu 


W.E. Nagel et al. (eds.), High Performance Computing in Science and Engineering ’12, 
DOl 10.1007/978-3-642-33374-3—28, © Springer-Verlag Berlin Heidelberg 2013 


375 



376 


H.-J. Panitz et al. 


shifts to the large-scale atmospheric forcing provided by low-resolution global 
climate models (GCMs). The shifted atmospheric fields are then used to drive 
CCLM simulations at 50 km resolution. These shifts have a considerable effect on 
the CCLM results, in particular during hydrological summer. Thus, the ensemble 
generation using the shifting method and, moreover, the usage of different GCMs 
are valuable for an improved representation of present climate conditions and 
projecting regional climate change. Soil moisture is a crucial component in the 
atmospheric water cycle, due to the long-term memory effect of the deep soil and 
its feedbacks with the atmosphere. Although observational data on the temporal 
and heterogeneous 3-dimensional spatial distribution of soil moisture is sparse, 
such information is necessary to initialise climate models-especially for climate 
forecasts. Sensitivity studies varying the initial soil moisture distribution have been 
performed with COSMO-CLM. The model was able to reproduce the observations 
with respect to the temporal evolution of a drought index for the major European 
regions. Even after several years effects of the soil initialisation could be found. 
The effect was most pronounced in areas with continental characteristics or at high 
latitudes (Scandinavia, Eastern Europe or the Mediterranean) and less towards the 
Atlantic. 


1 Introduction 

The investigation of regional climate change comprises basic studies on how 
temperature and precipitation are affected in general, but also specific impact 
studies whose results can be used for the planning of adaptation and mitigation 
measures by the authorities. The regional climate simulations efforts carried out 
at the Institute for Meteorology and Climate Research (IMK) of the Karlsruhe 
Institute of Technology (KIT) using the regional climate model (RCM) COSMO- 
CLM (CCLM) consider both aspects. 

The simulations for Africa [1] carried out within the CORDEX framework 
[2] (CORDEX: Coordinated Regional climate Downscaling Experiment), which 
are part of the basic studies, have been further elaborated. Furthermore, these 
simulations have been used to perform the transition from NEC computing systems 
to the new CRAY XE6, and to investigate the impact of this change on the model 
results. 

The KLIWA (Klimaveranderung und Wasserwirtschaft; www.kliwa.de) project 
“Bodenabtrag durch Wassererosion in Folge von Klimaveranderungen” aims to 
assess the impact of climate change on soil erosion in southern Germany. Within 
this project precipitation data relevant for soil erosion will be provided for the recent 
past 1971-2000 and near future 2021-2050 using CCLM. Various project-partners 
will use these data as input for an erosion model. The work will focus on modelling 
extreme precipitation events at higher spatial and temporal resolution (2.8 km, 1 km; 
1 h, 15 min). The very high spatial and temporal resolutions (1 km and 15 min) are 
required by erosion models. 
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To reach the goals of all projects and to assess the uncertainties of the climate 
projections, it is necessary to understand the sensitivity of the model results to 
various processes, e.g. the initialisation of soil moisture, and to create multi-member 
ensembles of climate simulations. The latter aspect is tackled within the Helmholtz 
Climate Initiative REKLIM (Regional Climate Change), where uncertainties related 
to the RCM and the large-scale atmospheric forcing from low-resolution global 
climate models (GCMs), which drive the RCM at its lateral boundaries, are assessed 
to draw reliable conclusions about the predictability of regional climate change. 


2 The CCLM Model 

The regional climate model CCLM is the climate version of the operational weather 
forecast model COSMO (Consortium for Small-scale Modeling) of the German 
Weather Service (DWD). It is a three-dimensional non-hydrostatic model which 
means that spatial resolutions below 10 km (which is considered the limit for 
hydrostatic models) are possible. The model solves prognostic equations for wind, 
pressure, air temperature, different phases of atmospheric water, soil temperature 
and soil water content. 

Further details on COSMO and its application as a RCM can be found in [3] and 
[4], on the web-page of the COSMO consortium (http://www.cosmo-model.org), 
and in [5,6], and [7]. 

The model is coded in Fortran 90, making extensive use of the modular structures 
provided in this language. Code parallelisation is done via MPI (message passing 
interface) on distributed memory machines using horizontal domain decomposition 
with a two-grid halo. 


3 Regional Climate Simulations Using the HLRS Facilities 
3.1 The Soil Erosion Project 

For the near future, recent research predicts an increase in precipitation in winter 
coupled with a slight decrease and more variability in summer in Central Europe. 
Despite high regional variability, studies indicate that there will be a rise in the 
frequency and intensity of summer extreme precipitation events. Such increased 
heavy summer rainfall is likely to lead to enhanced erosion risk. 

In this context, the KLIWA project “Bodenabtrag durch Wassererosion in Folge 
von Klimaveranderungen” was initiated to assess the impact of climate change on 
soil erosion in Southern Germany. Since soil erosion is a process acting on very 
small spatial and temporal scales, precipitation data and statistics in very high 
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resolution are necessary. Therefore, it is intended to downscale climate projections 
to regional scales using CCLM in order to reach a spatial and temporal resolution of 
1 km and 15 min respectively for selected locations with high erosion risk (namely 
Weiherbach in Baden-Wiirttemberg, which will be in the focus of this report, 
Mertesdorf in Rheinland-Pfalz, and Scheyern in Bayern). In addition, it is evaluated 
how the convection-permitting scale affects the precipitation statistics in comparison 
with coarser resolution, the changes in the climate change signal, and the erosion 
pattern due to extreme precipitation. 

For the high-resolution simulations a multiple nesting strategy is used; this 
means that in a first RCM step results of a global model are downscaled to a 
resolution of 50 km, whose results are then used to force the simulations with 
7 km resolution, and so on for the runs with 2.8 and 1km resolution. For all of 
these nesting steps the same CCLM version is used. Long-term simulations at 50 
and 7 km resolution have already been available from the CEDIM project “Flood 
Hazard in a Changing Climate” [1,8]. Two sets of downscaling experiments have 
been carried out from the global scale to 1 km resolution. The first uses ERA-40 
reanalyis data [9] to drive the 50 km simulation within the first nesting step. Its 
purpose is the validation of the results of the whole downscaling procedure (from 
50 to 1 km resolution) for the recent past, 1971-2000. In the second set the results 
of the GCM ECHAM5 [10] are used as forcing data for the past (1971-2000) and 
the future (2021-2050). In both sets of experiments the 1 km resolution simulations 
will be applied only to certain episodes because of constraints in simulation time 
and memory storage capacity. 

It is not known a priori whether the model is sensitive to the domain size and 
location as well as to internal setup. Therefore, sensitivity studies are needed in 
order to test the model behaviour to these parameters. 

To establish the best domain size and location, four domains have been defined 
(Fig. 1). For each of them a 1-year simulation (the year 1986 was chosen) has been 
carried out, using a horizontal resolution of 2.8 km. The number of grid points within 
the domains vary between 60*60 and 140*140. Domain S5 represents the innermost 
simulation area, which encompasses the major area of interest, Weiherbach, for 
which the results of all sensitivity studies are evaluated. The larger domain S3 
includes the entire upper and middle Rhine valley, the Black Forest, the Swabian 
Jura, Eifel and Hunsriick. Due to this selection regional atmospheric features 
induced by the local orography, e.g. the wind channeling in the Rhine Valley, can be 
represented more accurately. The largest domain, S4, covers the southwestern part 
of Germany, Luxemburg and part of France. 

For the sensitivity studies, the HYRAS [11] dataset has been used to validate the 
model results. This daily dataset, created by the German Weather Service (DWD), 
provides, for the period 1951-2006, the most important parameters for hydrological 
applications on a regular 1 km grid. 

This comparison revealed that the RCM results, especially precipitation averages, 
did not depend significantly on the chosen model domain (not shown). Therefore, 
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Fig. 1 Map of Central Europe showing the locations of Mertesdorf, Weiherbach, and Scheyem 
catchments, the investigation area for Weiherbach (black box), and the three model domains (S3, 
S4, and S5) used in the sensitivity study 


the smallest domain S5 has been chosen for further sensitivity studies on different 
model settings, and also for the long-term simulations. 

To test the sensitivity of the model to internal settings, two setups have been 
tested, one consistent with the setup of the simulation using the coarser resolution 
(7 km), and a second one, which is consistent with the suggestions of DWD for 
high-resolution simulations (COSMO-DE) [3]. A further test took into account 
a more frequent update of the forcing data (3-hourly versus 6-hourly for the two 
reference setups). 

The studies have been performed for a 5-year period (1980-1984). They showed 
that the first setup together with the 3-hourly update of forcing data provided the 
most realistic results. Thus, using the first setup allows a more coherent comparison 
of simulation results using two different horizontal resolutions (7 vs. 2.8 km). It will 
also be used for the long-term studies (using 2.8 km resolution) and the episodic 
studies (using 1 km resolution) of the Weiherbach catchment. 

Although the research is not yet finished, some insight can be gained about the 
added-value of the convection-permitting scale (2.8 km) in comparison with coarser 
spatial resolution (7 km). 

Simulations at 7 km resolution are known to have an excess of drizzle resulting 
in an overestimation of both frequency and amount of low precipitation intensities 
(RR < 1 mm per day). On the other hand, runs at 2.8 km systematically under¬ 
estimate the number of wet days (RR > 1 mm per day). This dryness of higher 
resolution simulations can be compensated by a more frequent update of the forcing 
data, because the characteristics of the coarse resolution get a stronger impact 
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Fig. 2 Difference between CCLM results and HYRAS data for mean daily precipitation for 
hydrological summers 1980-1984. (a) 7 km horizontal resolution, (b) 2.8 km horizontal resolution 


on the nested simulation leading to a substantial decrease of the dry days. The 
high-resolution simulation also represents well the frequency of high and very 
high precipitation days (RR> 10 mm and RR>20mm per day) over the whole 
investigation domain. 

For the evaluation area around Weiherbach, Fig. 2 displays the difference in 
mean daily precipitation amount between HYRAS data and model results. The daily 
precipitation amount is averaged over all days of the hydrological summer months 
(May to October) of the period 1980 until 1984. It is evident that higher resolution 
(Fig. 2, right) reduces the bias substantially. 


3.2 The Helmholtz Climate Initiative REKLIM 

Within the Helmholtz Network REKLIM (Regional Climate Change), high- 
resolution simulations from regional climate models are used to determine climate 
variability and change on the regional scale for attribution and impact studies 
(www.reklim.de). Uncertainties related to the RCM and the large-scale atmospheric 
forcing from low-resolution global climate models, which drives the RCM at its 
lateral boundaries, have to be assessed to draw reliable conclusions about the 
predictability of regional climate change. In order to capture these uncertainties, an 
ensemble of RCM simulations is generated by means of an innovative technique 
called Atmospheric Forcing Shifting (AFS). 

The RCM simulations are conducted using COSMO-CLM with a horizontal 
resolution of 50 km for Europe. ERA-40 reanalysis [9] provide the initial and 
boundary conditions which are updated every 6h. The CCLM simulations span 
the period 1979-1984 including 1 year spin-up time which is disregarded in the 
further investigations. Moreover, the CCLM output is analysed for Central Europe 
considering only land points. Ensemble generation via AFS is realised by shifting 
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Fig. 3 Observed and modelled mean annual cycle of (a) 2m temperature (°C) and (b) precipitation 
(mm/month) over Central Europe for the period 1980-1984 


the atmospheric forcing from the ERA-40 reanalyses with respect to topography 
to the North (N), South (S), West (W) and East (E) by 25 km (1) and 50 km (2), 
respectively [12]. The shifted atmospheric fields are then used to drive the 50km 
CCLM runs and, thus, eight shifting scenarios (N1, N2, S1 etc.) are performed, plus 
the reference simulation (Ref) which is forced by the non-shifted atmospheric fields. 

On the NEC SX-8 at the HLRS facilities, the computing requirements for the 
domain size of 118*110*40 grid points were about 48 node-hours per simulation 
year at 50 km resolution. Consequently, 108 node-days were required for the 6-year 
simulation period and nine model experiments. A node-hour is defined as the CPU 
time in hours that one node needs for the simulations, using all its available cores 
[1]. The necessary storage capacity per simulation year is in the order of 0.05 Tb 
and, thus, amounts to 2.7 Tb for the nine CCLM runs spanning the whole simulation 
period. 

In order to investigate the effect of the AFS on the CCLM simulations, the 
mean annual cycle of 2m temperature and precipitation (Fig. 3) are determined from 
the model output and compared to E-OBS observations [13]. The mean absolute 
deviation between Ref and E-OBS is 1.0 °C for 2m temperature and 15.9 mm/month 
for precipitation. In average, the spread between the COSMO-CLM ensemble is in 
the order of 0.4 °C for 2m temperature and 5.4mm/month for precipitation which 
amounts to 40 % (2m temperature) and 34 % (precipitation) in comparison to the 
deviation between Ref and E-OBS. For 2m temperature, the minimum ensemble 
spread can be found in February (0.1 °C) whereas the maximum is in May (0.9 °C). 
For precipitation, the ensemble spread is lowest in December (3.1 mm/month) and 
largest in September (8.0mm/month). 

In comparison to an ensemble of COSMO-CLM simulations driven by different 
GCMs [1] (Fig. 4), the spread of the AFS ensemble is clearly smaller and, in average, 
amounts to 15 % (2m temperature) and 27 % (precipitation) of the GCM ensemble. 
In particular for precipitation, the variability of the AFS ensemble spread during 
the mean annual cycle is much larger than for 2m temperature. The combination 
of both ensemble approaches, AFS and forcing from different GCMs, leads to 
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Fig. 4 Mean annual cycle of (a) 2m temperature (°C) and (b) precipitation (mm/month) from 
E-OBS, AFS ensemble and GCM ensemble over Central Europe for the period 1980-1984 


an increase of the ensemble spread especially in summer. The resulting improved 
reproduction of observations goes along with an advanced representation of present 
climate conditions. 

Moreover, the AFS effect on the spatial distribution of 2m temperature and 
precipitation is exemplified for the W2 and E2 scenarios (Fig. 5). In comparison 
to Ref, the W2 scenario leads to temperature increases of up to 0.6 °C (Fig. 5a). 
This coincides with a predominant precipitation decrease of up to 20.4 mm/month 
with large changes over the Alps (Fig. 5b). Precipitation enhancement (up to 
15.4 mm/month) occurs in regions where the temperature increases are lower 
than in the surrounding areas. In contrast, the E2 scenario leads to temperature 
reduction of up to 0.3 °C in France, Germany, Poland and the Balkan States 
whereas regions with rising temperature (up to 0.2 °C) are rather small (Fig. 5c). The 
predominant temperature decrease is associated with enhanced precipitation of up 
to 16.9 mm/month (Fig. 5d). Precipitation decreases of up to 23.9 mm/month occur 
again over the Alps which might be an indicator for the sensitivity of precipitation 
to AFS over orographically structured terrain. As it can be seen in Fig. 5, the 
temperature and precipitation changes resulting from AFS are related to each other. 
In particular, warming coincides with reduced precipitation whereas cooling is 
associated with precipitation increase. Furthermore, the comparison to E-OBS (not 
shown) indicates that deviations between Ref and E-OBS can be partly compensated 
through AFS. 

The model experiments show that AFS is a valuable ensemble generation 
technique for capturing uncertainties related to RCMs, the large-scale atmospheric 
forcing, and extreme events. In particular, the synergy of combining different 
ensemble approaches such as AFS and forcing from different GCMs fosters the 
assessment of uncertainties in climate modelling and of the predictability of climate 
change on regional scales. For the generation of large high-resolution ensembles, 
the HLRS resources thus remain crucial. In the future, the CRAY XE6 at HLRS 
will be used for generating a four-member ensemble of COSMO-CLM simulations 
based on AFS at horizontal resolutions of 50 and 7 km for 30-year periods under 
present and future climate conditions. 
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Fig. 5 Difference between the simulations W2-Ref (a and b) and E2-Ref (c and d) for 2m 
temperature (°C) (a and c) and precipitation (mm/month) (b and d) for the period 1980-1984 
(only hydrological summer) 


3.3 Sensitivity to Soil Moisture Initialization 

Soil moisture is a crucial component in the atmospheric water cycle. It provides 
a longer-term memory compared to the atmosphere, due to its storage capacity 
and the longer response times of deeper layers. The feedbacks between soil and 
atmosphere include precipitation as the primary source for the soil water content 
and the evapotranspiration of water vapour from soil and vegetation towards the 
atmosphere [14]. 

Climate models need reliable information about the 3-dimensional distribution 
of the soil moisture at the initialisation stage. Errors in the soil moisture fields will 
cause artificial trends in the model simulations overlaying the climate trends. The 
observational base to derive soil moisture distributions for large areas at a given 
time is insufficient and the soil is very heterogeneous even on small scales. The few 
ground-based observations are therefore likely not representative for larger regions 
and usually do not cover climatological time scales or large areas. Therefore, remote 
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Fig. 6 Time series of the effective drought index (EDI) for Mid-Europe (Data derived from CCLM 
(red line) and from E-OBS (black line)) 


sensing is likely to be the only method to provide large-scale information, although 
these measurement techniques only cover the uppermost layers of the soil [15]. 

Under such circumstances regional climate models use a very pragmatic 
approach. They generate their own balanced soil moisture distribution by a pre¬ 
analysis phase simulation starting from simple vertical moisture profile. Usually, 
about 1-3 years spin-up time is used as a compromise between computing time 
requirements and the need for balanced soil moisture distribution. 

To test the uncertainties and sensitivities induced by this initialisation problem, 
several simulation experiments with CCLM have been performed. The model 
domain covers Europe with a resolution of 50 km. The 30-year base simulation 
covers the period 1968 until 2000 and is driven by the ERA-40 reanalysis [9] to 
allow for a direct comparison with observations. 

To identify drought and moist periods the Effective Drought Index (EDI) [16] is 
calculated for simulated and observed precipitation (Fig. 6). The CCLM simulation 
is able to reproduce the temporal development of EDI very closely. The duration and 
intensity of extreme periods, like the long-term drought in 1976, are well matched. 

From this basic analysis a test period for the sensitivity experiments has been 
derived. The concept of the experiments is to examine the uncertainty induced 
by the insufficient information of the soil moisture in connection with the long¬ 
term memory of the soil. With all other conditions fixed, the initial soil moisture 
conditions have been altered for different starting dates September 1st, 1972, 
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Year 


Fig. 7 Monthly NRMSD of total precipitation, evapotranspiration and soil moisture in different 
soil layers (Data extracted from the experiment with reduced soil moisture by 50 % on January 1st, 
1973) 


January 1st, 1973 and June 1st, 1973 (blue lines in Fig. 6) to study the effect of 
initialisation in different seasons and changing soil conditions. The soil conditions 
have been altered by up to 50 % towards wetter and drier conditions. 

Figure 7 shows the normalized root mean square difference (NRMSD) of the 
sensitivity experiment compared to the base simulation for different soil layers, 
precipitation and evapotranspiration in Central Europe. The time series of NRMSD 
for an experiment starting on January 1st, 1973 with reduced soil moisture by 50 % 
shows an initial relaxation of the disturbed soil moisture conditions towards the 
bases simulation. The response time is much longer in deeper soil layers. But even 
after several years a memory of the starting conditions is visible. This is also the 
case not just within the soil but also for precipitation and evapotranspiration. The 
differences are most pronounced in summer with typically drier soil conditions than 
in winter. 

The experiments cover different European regions with different climatologi¬ 
cal characteristics from the Iberian Peninsula to Scandinavia. The regions show 
different sensitivities with respect to the soil initialisation especially in the deep 
soil (Fig. 8). In the uppermost layers (above 20-30 cm depth) all regions show a 
relaxation time within typical time for a spin-up simulation. But the lower soil 
layers need considerably more time to get to a state close to the base simulation. 
The longest relaxation periods were found in Scandinavia, where frozen soils may 
occur, followed by Eastern Europe with more continental conditions and the dry 
Mediterranean area. The westerly regions, close to the Atlantic, like the British Isles 
and France need somewhat shorter times. 










Month 


Fig. 8 Dependence on spin-up time of European regions and soil layers (Data extracted from the 
experiment with reduced soil moisture by 50 % on January 1st, 1973) 


To conclude: 

• The highest soil moisture sensitivities were found for dry conditions and when 
frozen soils are involved. 

• Under moist conditions, when the soil is close to saturation, the memory is 
shorter than for dry conditions. Therefore, a model initialisation in winter or 
other moist periods is preferable. 

• Even a several year spin-up period of climate models might not be sufficient. 
An initial soil moisture initialisation based on long-term prior simulations might 
reduce the problem but further studies and a better observational base are needed 
to overcome this initialisation problem. 


3.4 The CORDEX Framework 

Simulations with CCLM had been carried out for Africa within the CORDEX 
framework [2]. Africa had been chosen as the first CORDEX target region because 
of its vulnerability to climate change in terms of impacts on temperature and 
precipitation patterns. The ERA-Interim [17] driven evaluation results, which used 
the demanded horizontal resolution of 0.44° (AFR44), had been briefly described 
in [1], Besides this evaluation run a simulation with a higher resolution of 0.22° 
(AFR22) had been performed, which uses also the direct forcing from the ERA- 
Interim reanalyses. A comparison of the results with climatological observations 
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Fig. 9 Annual cycle of monthly means of temperature at 2 m height (upper row) and mean 
monthly sums of precipitation (lower row) for African land regions north (NEC, left) and 
south (SEC, right) of the Equator for the period 1991-2000; blue curves: results of AFR22 
CCLM simulation; green curves: results of AFR44 CCLM simulation; red curves: climatological 
observations (M. Buchner, PIK, Potsdam, pers. comm.) 


did neither show a significant improvement of the model results due to the finer 
resolution nor a significant deterioration (Fig. 9). In general, the higher resolution 
led to a slight increase of mean monthly temperatures by about 0.4 °C, and to 
slightly more precipitation of about 40 mm/month. 


3.5 Implementation on CRAY XE6 

After some technical changes in the code, the RCM COSMO-CLM could be 
successfully implemented on the CRAY XE6 at HLRS. With the help of scaling 
tests a suitable domain decomposition into 32*32 sub-domains (= 1,024 cores) 
had been found as appropriate for all simulation activities. Much effort had been 
spent to find compiler options that guarantee identical results when changing only 
the domain decomposition, and thus the number of cores. Table 1 summarizes the 
CRAY and INTEL compiler options with highest optimisation level that fulfill this 
requirement. 

The next step was to study the impact of the change of the computing system 
(from NEC SX8 to CRAY XE6) on the model results. For this purpose the CORDEX 
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Table 1 Recommended option to compile the RCM 
COSMO-CLM using the CRAY or INTEL program¬ 
ming environments on CRAY XE6 


Programming 

Compiler 

environment 

options 

CRAY 

-c -Ol -0 fpl -eF 

INTEL 

-c -cpp -Os -no-vec 


Table 2 Performance of CCLM CORDEX Africa simulation on 
CRAY XE6 and NEC SX8. All values are valid for one simulation year; 
domain size of 0.44° simulation: 214*221*35; unit nh/year: node hours 
per simulation year 


Computing 

No. of 

No. of 

Wall-clock 

CPU-time 

platform 

cores 

nodes 

time (h) 

(nh/year) 

CRAY XE6 

1,024 

32 

2.8 

90.1 

NEC SX8 

16 

2 

8.8 

17.0 


Africa evaluation simulations have been repeated on the CRAY for the period 1989 
till 1996, using the identical forcing data as on the NEC SX8. Fortunately, the 
differences between the results are negligible (not shown). 

This test gave also the opportunity to compare the performances of the simula¬ 
tions on both computing systems. Table 2 summarizes the results of this comparison. 
Due to the large number of cores that can be used on the CRAY the wall-clock time 
is much less than on the NEC SX8 for a 1-year simulation. On NEC SX8 there was 
no performance gain when using more than 16 cores (2 nodes). In contrast, the CPU 
time per simulation year increases considerably when using the CRAY. For this test 
all cores of a node had been used. When using only half the number of cores per 
node on the CRAY without changing the total numer of cores (thus, doubling the 
number of used nodes), wall-clock time and CPU time could be reduced by about 
25 % due to optimized cache usage. 


4 Future Work and HPC Demands 

Between the timescales of short-term and seasonal weather prediction on the 
one side, and long-term climate projections on the other side, there is a gap in 
the timescales of several years to several decades, for which climate predictions 
are not yet possible. On these timescales, weather and climate patterns depend 
not only on the anthropogenic rise of greenhouse gas concentrations, but also 
strongly on natural climate variability, induced by variations in the atmosphere and 
oceans (internal natural forcing), and by solar variability and volcanic eruptions 
(external natural forcing). The aim of the research program MIKLIP (Mittelfristige 
Klimaprognosen), which has been initiated by the Federal Ministry for Science 
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and Education (BMBF), is to close this gap, and to develop a system for climate 
predictions for up to a decade ahead. The IMK contributes to MIKLIP in joint 
projects which aim at regional decadal predictions of climate for Central Europe 
and the West African monsoon region. 

In addition, the works within the CORDEX framework, the KLIWA erosion 
project, and the REKLIM research initiative will be continued. 

The new project KLIMOPASS continues the goals of the former project “Her- 
ausforderung Klimawandel”. In this project, new ensembles of high-resolution 
climate simulations (7 and 2.8 km), driven by new GCM data, will be established 
and analysed. New ensemble generating techniques, taken e.g. from the REKLIM 
project, will also be used. The projection period will include 2021-2050, and 
possibly also 2071-2100. 

For all these activities large demands of High Performance Computing are 
necessary. A rough estimate amounts to about 400.000 CPU-h for the next 2 years, 
and 160 Tb of storage capacity, which have to be available online in order to perform 
the CCLM simulations and the subsequent analysis of the model results (post¬ 
processing). 
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Setting Up Regional Climate Simulations 
for Southeast Asia 


Patrick Laux, Van Tan Phan, Christof Lorenz, Tran Thuc, Lars Ribbe, 
and Harald Kunstmann 


Abstract Climate change and climate variability are main drivers for land-use, 
especially for regions dominated by agriculture. Within the framework of the project 
Land-Use and Climate Change Interactions in Central Vietnam (LUCCi) regional 
climate simulations are performed for Southeast Asia in order to estimate future 
agricultural productivity and to derive adaptive land-use strategies for the future. 
Focal research area is the Vu Gia-Thu Bon (VGTB) river basin of Central Vietnam. 
To achieve the goals of this project reliable high resolution climate information for 
the region is required. Therefore, the regional non-hydrostatic Weather Research 
and Forecasting (WRF) model is used to dynamically downscale large-scale 
coupled atmosphere-ocean general circulation model (AOGCM) information. WRF 
will be driven by the ECHAM5-GCM data and the business-as-usual scenario 
A1B for the period 1960-2050. The focus of this paper is on the setup of WRF 
for East Asia. Prior to running the long-term climate simulation in operational 
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mode, experimental simulations using different physical parameterizations have 
been conducted and analyzed. Different datasets have been used to drive the WRF 
model and to validate the model results. For the evaluation of the parameterization 
combination special emphasis is given to the representation of the spatial patterns of 
rainfall and temperature. In total, around 1.7 Mio CPUh are required to perform the 
climate simulations. The required computing resources have been approved from 
the Steinbuch Centre for Computing (KIT, SCC). 

1 Introduction 

Climate change and climate variability are of major concern for Central Viet¬ 
nam’s environment and people’s well-being. Increasing frequency and severities of 
extreme events like floods, droughts, hurricanes but also increasing temperatures, 
sea level rise and salt water intrusion in the coastal areas are expected to have 
dramatic consequences for agricultural productivity and thus food security in the 
VGTB basin. These challenges demand for informed stakeholders and a land 
management strategies to increase the resilience of the ecosystems. 

Vietnams lowlands and midlands are predominantly characterized by rice land¬ 
scapes. Rice is the pillar of food security for many million households. Rice 
production is complexly linked with land management, water and environment. 
Judicious management of rice ecosystems is seen today - in a post-green revolution 
age - as a major strategy for raising rice productivity, protect the environment 
and achieve long term food security for rural and urban populations in Vietnam 
and all over rice producing Asia. Many problems related to rice production and 
climate change/climate variability in Vietnam became obvious during the last 
decades. Temperature stress, especially during sensitive rice development stages, 
negatively affects crops development and yields. Sea level rise and salt water 
intrusion in lowland coastal areas are affecting rice cropping, while the magnitude 
of effects depend on the complex interactions between the cropping calendar and 
the hydrological situation. Although rice is a very salt resistant crop, salinity levels 
beyond threshold levels will eventually decrease yields. Excessively high water level 
and prolonged inundation periods can severely affect yields. 

The major goals of the LUCCi project is to provide a sound future land use 
management framework that considers socio-economic development, population 
growth and expected impacts of climate change on land and water resources. This 
framework links climate change mitigation - through the reduction of GHC emis¬ 
sions - with adaptation strategies to secure food supply in a changing environment. 
As a basis, present and past land use practices and the use of water resources in 
the VGTB basin are analyzed with special emphasis on possible climate change 
impacts. This in-depth analysis will allow deriving carbon-optimized land and water 
use strategies for the VGTB basin as well as for the larger region of Central Vietnam. 

The Dynamical Downscaling (DDS), which is performed by KIT, IMK-IFU 
will contribute to assess future land-use and quantify agricultural productivity in 
the VGTB basin of Central Vietnam. This requires reliable and high resolution 
information about the climatology for the present and past, but also future regional 
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climate projections. It is widely accepted that present-day General Circulation 
Models (GCMs) are able to simulate the large-scale state of the atmosphere in a 
realistic manner, and predict large-scale climate change based on assumptions about 
future greenhouse gas emissions (AR4, IPCC). Their implications on regional and 
local scales, however, are inadequate mainly due to the limited representation of 
mesoscale atmospheric processes, topography, and land-sea distribution in GCMs 
(e.g. [2, 14]). A direct application of GCM output for regional and local impact 
studies would lead to inconsistencies in frequency statistics, such as the occurrence 
probabilities of rainfall events (e.g. [7, 12]). Within the LUCCi framework, both 
Dynamical as well as Statistical Downscaling (SDS) approaches will be followed 
and combined in that way, that the advantages of each approach are capitalized: 
The scarcity of observed climate data in this region, and the most probably non¬ 
stationary climatic processes for the future period of interest (2010-2050) demand 
for a DDS approach. Therefore, the regional non-hydrostatic Weather Research and 
Forecasting (WRF) model will be driven in three nesting steps with resolutions 
of 45, 15, and 5 km to obtain transient and consistent climate simulations for the 
period 1960-2050. Due to the high computational demands, WRF will solely be 
driven by the coupled atmosphere-ocean general circulation model ECHAM5 and 
the business-as-usual A1B scenario. The underlying assumptions for the chosen 
scenario are a future world of very rapid economic growth, low population growth 
and rapid introduction of new and more efficient technology. In addition to the 
dynamical downscaling approach, a multi-model and multi-scenario ensemble- 
based SDS technique will be developed which allows for quantifying uncertainties 
inherent to the downscaling approaches, the GCMs and the chosen scenarios. 

The main goal of this paper is to identify a suitable setup of WRF physical param- 
eterizations for Southeast Asian, which is the prerequisite to conduct long-term 
climate simulations. Based on the identified setup, the long-term climate simulations 
are conducted. Albeit few regional climate simulation efforts have been done for 
Southeast Asia (e.g. [9]), to the best of our knowledge these simulations represent 
the first efforts (i) to identify the best physical parameterization combination in a 
systematic manner, and (ii) to conduct transient long-term climate simulations for 
the VGTB river basin with this detailed spatio-temporal resolution. 


2 The Regional Climate Model Simulations 
2.1 The Regional Climate Model WRF 

WRF is a next-generation mesoscale numerical weather prediction system designed 
to serve both operational forecasting and atmospheric research needs. The WRF 
Software Framework (WSF) provides the infrastructure that accommodates the 
dynamics solvers, physics packages that interface with the solvers, programs for 
initialization, WRF-Var, and WRF-Chem. There are two dynamics solvers in the 
WSF. The one applied in this project is the Advanced Research WRF (ARW) 
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solver which was primarily developed at NCAR (National Centre for Atmospheric 
Research, USA). The ARW dynamics solver integrates the compressible, non¬ 
hydrostatic Euler equations. The equations are cast in flux form using variables 
that have conservation properties. The equations are formulated using a terrain¬ 
following mass vertical coordinate. The flux form equations in Cartesian space are 
extended to include the effects of moisture in the atmosphere and projections to the 
sphere. 

For the temporal model discretization the ARW solver uses a time-split integra¬ 
tion scheme. Generally speaking, slow or low-frequency (meteorologically signifi¬ 
cant) modes are integrated using a third-order Runge-Kutta (RK3) time integration 
scheme, while the high-frequency acoustic modes are integrated over smaller time 
steps to maintain numerical stability. The horizontally propagating acoustic modes 
(including the external mode present in the mass-coordinate equations using a 
constant-pressure upper boundary condition) and gravity waves are integrated using 
a forward-backward time integration scheme, and vertically propagating acoustic 
modes and buoyancy oscillations are integrated using a vertically implicit scheme 
(using the acoustic time step). The time-split integration for the flux-form equations 
is described and analyzed in [5]. 

The spatial discretization in the ARW solver uses a C grid staggering for the 
variables. That is, normal velocities are staggered one-half grid length from the 
thermodynamic variables. The grid lengths Ax and Ay are constants in the model 
formulation; changes in the physical grid lengths associated with the various 
projections to the sphere are accounted for using map factors. The vertical grid 
length A2 rj is not a fixed constant; it is specified in the initialization. The user is 
free to specify the ij values of the model levels subject to the constraint that r/ = 1 
at the surface, rj = 0 at the model top, and ij decreases monotonically between the 
surface and model top. 


2.2 WRF Domain Setup 

For the research project FUCCi climate simulations with a target resolution of 5 km 
for Central Vietnam for 1960-2050 shall be performed. For this purpose, WRF is 
nested in the general circulation model ECHAM5 using the following setup and 
location of the domains (Fig. 1). 


1. Domainl: 

• Horizontal: 99 x 99 grid points with a resolution of 45 km 

• Vertical: 50 layers up to 5,000Pa 

• Time step: 180 s (adaptive time-step option enabled) 

(continued) 
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Fig. 1 Domains to be 
modelled by the regional 
climate model WRF using 
nesting strategy 
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(continued) 

2. Domain2: 

• Horizontal: 142 x 145 grid points with a resolution of 15 km 

• Vertical: 50 layers up to 5,000Pa 

• Time step: 180 s (adaptive time-step option enabled) 

3. Domain3: 

• Horizontal: 66 x 75 grid points with a resolution of 5 km 

• Vertical: 50 layers up to 5,000Pa 

• Time step: 180 s (adaptive time-step option enabled) 


2.3 Required HPC Ressources 

Due to the limitations of the CFL stability condition, the option adaptive time-step 
has been enabled resulting in an indefinable number of required integration steps. 
Calculating with the predefined time steps of 180 s, more than 15Mio integration 
steps are required respectively on 490,050 grid cells for Domain 1, 1,029,500 
grid cells for Domain 2, and 247,500 grid cells for Domain 3 with more than 
10 degrees of freedom on each grid cell (momentum, mass,pressure and various 
mixing ratios for moisture like water vapor, cloud water, ice water, rainwater, 
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Fig. 2 Performance of WRF 
on HPC XC4000 in Karlsruhe 



snow etc.). The resulting size of this task makes it necessary to move to a suitable 
high performance computing environment. 

Software development, testing, benchmarking, and required preprocessing are 
performed on the KIT, IMK-IFU linux cluster. KIT, IMK-IFU is managing an 
Infiniband based Linux-Cluster with 116 Opteron, 96 Istan, and 192 Magny 
processors. 

The preprocessing is performed on annual basis. The generated files, which are 
required to drive WRF, are then transfered to the HPC environement XC4000 via 
scp. Based on the performed preprocessing WRF is also run in annual time slices, 
however the WRF restart option enables transient climate simulations, i.e. without 
initializing WRF every year. WRF and other required software such as netcdf have 
been successfully installed at the cluster architecture HPC XC4000 using a test 
account. WRF test runs have been performed using shared memory parallelism 
(OpenMP). 

It is found that 128 processors (32 CPU cores) show the best performance (see 
Fig. 2). Computing time for the three domains for 1 month was approximately 8 h 
which results in 8 x 128 = l,024CPUh. 

In order to derive the signal expected from future climate change and climate 
variability, the future climate projection must be compared against the control run. 
Beside that, a climate simulation driven by ERA40 reanalysis will performed to 
assess the quality of the control run. The simulation efforts can be subdivided into 
three blocks of 41 years of simulation time plus 4 years of spin-up time. 

1. Climate simulations of the control period (1960-2001) 

2. Climate simulations for the future scenario A1B (2001-2050) 

3. ERA40 reanalysis simulations of the control period (1960-2001) 

This means that for each block of 45 years 45 x 12 x 1,024 = 552,960 CPUh are 
required, which we extend to 553,000 CPUh to have some additional capacities. 
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For the proposed climate simulations for LUCCi, in total 3 x 553,000 = 
1,659, OOOCPUh are required for the ERA40 reanalysis and control plus the future 
climate simulations (A1B). With the proposed 128 CPUs this means approximately 
180 days pure computing time per 45 years simulation. Thus, we asses around 
8 month total simulation time per block (6 month pure computation time and 
queue/waiting time), which results in a total time of 2 year for the proposal. 

For the intermediate storage of the required input data and the WRF output of 
a short time period we estimate a storage capacity of 2 TB permanent and 2 TB 
temporary disc space at HP XC2. The results will be transfered to KIT, IMK-IFU. 
To permanently archieve the simulation results a storage capacity of about 30 TB is 
required. 


3 Experimental Parameterization Combinations of WRF 

Due to the fact that long-term WRF simulations for Southeast Asia are applied for 
the first time by KIT, IMK-IFU the performance of climate simulation at a HPC 
XC4000 requires several preparatory works. These mainly include test simulations 
with WRF in order to find an optimal setup of the physical parameters, i.e. 
parameterization for the microphysics, planetary boundary layer, and also cumulus 
convection. Please note that the cumulus convection scheme is swichted off for the 
5 km resolution (here: Domain 3). The different parameter combinations applied for 
the year 2000 are shown in Table 1 . 

The experimental climate simulations are performed for the year 2000 and 
validated using gridded observation data for rainfall and temperature. The validation 
of simulated precipitation fields has been performed using Asian Precipitation - 
Highly-Resolved Observational Data Integration Towards Evaluation of the Water 
Resources in 0.25° resolution [15], hereinafter referred to as Aphrodite data, and for 
temperature, CRU TS 2.1 data in 0.5° resolution [8], referred to as CRU data, have 
been used. As boundary conditions for the RCM, both NCEP/NCAR [3,4] as well as 
ERA40 [13] reanalysis data have been applied. As the first directive of the regional 
climate simulations is to match the spatial patterns of precipitation and temperature, 
the results of the climate simulations of Domain 2 are validated, because they 
match best with the resolution of the gridded observations. Nevertheless, the climate 
simulation outputs are regridded to the resolution of gridded observations, i.e. 0.25° 
and 0.5° for Aphrodite and CRU, respectively. In the following section the climate 
simulation results of Domain 2 driven by the NCEP/NCAR and ERA40 reanalysis 
data are validated against the gridded observation data for the year 2000. The results 
for precipitation and temperature are presented in the sequel. The Taylor diagrams 
provide a visual framework to compare the simulated (WRF) results against a 
reference (here: the gridded precipitation product Aphrodite). 
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Table 1 Physical parameterization combinations of the different WRF 
experimental runs. Microphysics: 2 - Lin, 3 - WRF Single-Moment-3- 
class, 13 - Ston Brook University; Planetary Boundary Layer (PBL): 
1 - Yonsei University, 5 - Mello-Yamada Nakanishi and Nino Level 2.5 
PBL; Cumulus: 2 - Betts-Miller-Janjic, 8 - New Simplified Arakawa- 
Schubert 



Microphysics 

D1-D3 

PBL 

D1-D3 

Cumulus 

D1-D2 

B 

2 

1 

2 

C 

2 

5 

2 

D 

2 

5 

14 

E 

2 

1 

14 

F 

3 

1 

2 

G 

3 

5 

2 

H 

3 

1 

14 

I 

3 

5 

14 

J 

13 

1 

2 

K 

13 

5 

2 

L 

13 

5 

14 

M 

13 

1 

14 


4 Validation Results of WRF Simulations Driven 
by NCEP/NCAR and ERA40 

Figure 3 shows the Taylor diagrams for precipitation (WRF-NCEP/NCAR com¬ 
pared to Aphrodite) obtained for the different physical parametrization setups. Partly 
high differences between the different setup can be found for precipitation. The 
spatial correlation patterns strongly depend on the season: the correlation is higher 
for winter (DJF) and fall (SON), the periods in which most of the rain falls in 
Southeast Asia. The correlation coefficient is lowest for summer (JJA), the season 
in which convective rainfall dominates. This season the standard deviation as well 
as the RMS differences are greatest. 

In general, the spatial correlations for temperatures between NCEP/NCAR 
reanalysis driven WRF simulations and observations (CRU) are high. They are in 
the order of 0.8, and the different physical parameterizations are not very sensitive 
for temperature (not shown here). 

Comparing the ERA40 driven simulation results, illustrated in Fig. 4, with the 
results obtained using NCEP/NCAR reanalysis data, one can clearly observe a better 
representation of the spatial correlations for the summer season while the skill of 
the fall and winter seasons are reduced. The deviations of single WRF experiments 
compared to the observations, such as B, C, J, and K (see Table 2) illustrated as 
RMS differences and standard deviations are drastically increased. 

However, the temperature bias is reduced compared to the NCEP/NCAR driven 
WRF simulation results. 
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Fig. 3 Taylor diagram of NCEP/NCAR driven WRF simulations and observed (Aphrodite) 
precipitation amounts for (a) winter 2000 (DJF), (b) spring 2000 (MAM), (c) summer 2000 (JJA), 
(d) fall 2000 (SON), and (e) the whole year of 2000. The observation data (here: Aphrodite) is 
shown as A, the coding for the parameter combinations can be obtained from Table 1 . The similarity 
of two patterns is quantified in terms of their correlation (blue), their root-mean-square (RMS) 
difference (green), and their standard deviation (black) 
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Fig. 4 Taylor diagram of ERA40 driven WRF simulations and observed (Aphrodite) precipitation 
amounts for (a) winter 2000 (DJF), (b) spring 2000 (MAM), (c) summer 2000 (JJA), (d) fall 2000 
(SON), and (e) the whole year of 2000 













Table 2 Bias of mean temperatures between NCEP/NCAR reanalysis driven WRF simulations and gridded observations CRU for winter, spring, summer, and 
fall of the year 2000 and the whole Domain 2. The observation data (here: CRU) is shown in A, the coding for the parameter combinations can be obtained 
Table 1 



Winter 

X 

o 

RMSc 

r 

Spring 

X 

o 

RMSc 

r 

Summer 

X 

o 

RMSc 

r 

Fall 

X 

o 

RMSc 

r 

A 

28.22 

5.19 

- 

- 

25.63 

2.82 

- 

- 

26.71 

1.87 

- 

- 

24.71 

2.24 

- 

- 

B 

13.01 

6.67 

4.33 

0.76 

22.95 

3.10 

1.87 

0.81 

24.99 

3.00 

1.94 

0.78 

19.11 

3.99 

2.38 

0.85 

C 

11.11 

7.29 

4.77 

0.76 

22.38 

3.05 

2.07 

0.75 

24.66 

3.21 

2.09 

0.79 

19.28 

3.99 

2.39 

0.85 

D 

13.31 

6.95 

4.22 

0.80 

22.50 

3.16 

1.94 

0.80 

24.76 

3.11 

2.07 

0.76 

18.30 

4.34 

2.83 

0.81 

E 

14.66 

6.89 

4.01 

0.81 

20.11 

3.16 

1.98 

0.79 

20.91 

3.08 

2.21 

0.70 

19.44 

3.61 

2.05 

0.86 

F 

10.46 

7.18 

5.11 

0.70 

22.07 

3.12 

1.93 

0.79 

24.78 

3.02 

1.94 

0.78 

17.61 

4.39 

2.89 

0.81 

G 

14.67 

6.84 

3.93 

0.82 

22.28 

3.21 

2.07 

0.77 

22.56 

3.05 

2.05 

0.75 

18.53 

4.08 

2.52 

0.84 

H 

14.93 

6.54 

3.69 

0.83 

20.37 

3.21 

1.91 

0.81 

17.81 

3.16 

2.28 

0.70 

16.75 

3.83 

2.38 

0.82 

I 

13.41 

6.86 

4.14 

0.80 

19.88 

3.33 

2.08 

0.78 

17.79 

3.43 

2.54 

0.69 

16.20 

4.29 

2.66 

0.85 

J 

12.18 

6.56 

4.33 

0.75 

22.70 

3.13 

1.89 

0.80 

25.69 

3.20 

2.08 

0.79 

19.64 

3.82 

2.22 

0.86 

K 

13.96 

7.06 

4.47 

0.78 

22.82 

3.07 

2.00 

0.77 

24.04 

3.08 

2.00 

0.78 

17.61 

4.25 

2.77 

0.81 

L 

13.45 

6.89 

4.20 

0.79 

22.33 

3.18 

1.92 

0.80 

24.44 

3.15 

2.08 

0.77 

18.48 

4.15 

2.63 

0.82 

M 

14.54 

6.99 

4.13 

0.81 

20.30 

3.31 

1.95 

0.81 

23.60 

3.38 

2.36 

0.74 

16.03 

4.14 

2.67 

0.81 
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Table 3 ERA40-CRU (winter, spring, summer, and fall). The observation data (here: CRU) is shown in A, the coding for the parameter combinations can be 
obtained Table 1 



Winter 

X 

o 

RMSc 

r 

Spring 

X 

o 

RMSc 

r 

Summer 

X 

o 

RMSc 

r 

Fall 

X 

a 

RMSc 

r 

A 

28.22 

5.19 

- 

- 

25.63 

2.82 

- 

- 

26.71 

1.87 

- 

- 

24.71 

2.24 

- 

- 

B 

17.89 

6.83 

3.37 

0.88 

23.12 

2.95 

1.66 

0.83 

24.76 

2.24 

1.38 

0.79 

21.24 

3.08 

1.60 

0.87 

C 

17.16 

6.71 

3.41 

0.87 

23.07 

2.90 

1.67 

0.83 

24.67 

2.23 

1.38 

0.79 

20.72 

3.14 

1.67 

0.86 

D 

17.34 

6.61 

3.27 

0.87 

22.88 

2.95 

1.62 

0.84 

24.16 

2.40 

1.41 

0.81 

21.06 

3.30 

1.73 

0.87 

E 

18.17 

6.82 

3.29 

0.88 

22.31 

3.12 

1.79 

0.82 

24.96 

2.47 

1.37 

0.84 

20.89 

3.36 

1.81 

0.87 

F 

17.38 

6.89 

3.59 

0.86 

23.24 

2.96 

1.66 

0.84 

24.81 

2.32 

1.36 

0.81 

21.10 

3.25 

1.73 

0.87 

G 

17.18 

6.62 

3.31 

0.87 

22.94 

2.94 

1.69 

0.83 

24.80 

2.24 

1.31 

0.81 

20.60 

3.17 

1.70 

0.86 

H 

17.31 

6.77 

3.52 

0.86 

23.20 

3.05 

1.54 

0.86 

26.99 

3.00 

1.72 

0.85 

21.92 

3.40 

1.76 

0.89 

I 

17.44 

6.52 

3.16 

0.88 

24.05 

3.18 

1.61 

0.86 

26.02 

2.75 

1.52 

0.85 

21.44 

3.34 

1.73 

0.88 

J 

18.10 

6.80 

3.31 

0.88 

23.39 

2.96 

1.65 

0.84 

24.88 

2.16 

1.35 

0.78 

21.37 

3.06 

1.57 

0.87 

K 

17.59 

6.67 

3.28 

0.88 

23.15 

2.87 

1.68 

0.83 

24.73 

2.15 

1.35 

0.78 

20.85 

3.08 

1.61 

0.86 

L 

17.48 

6.74 

3.40 

0.87 

23.16 

3.12 

1.87 

0.80 

24.38 

2.17 

1.28 

0.81 

18.96 

3.35 

2.06 

0.80 

M 

17.44 

6.52 

3.16 

0.88 

24.05 

3.18 

1.61 

0.86 

26.02 

2.75 

1.52 

0.85 

21.44 

3.34 

1.73 

0.88 
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Amongst all test parameter combinations it is found that combination G (see 
Table 1), i.e. WRF Single-Moment-3-class, Mello-Yamada Nakanishi and 
Nino Level 2.5 PBL, and Betts-Miller-Janjic is leading to reasonable results 
for both precipitation and temperature values. Due to the lower biases in 
temperature (especially during the winter months) it is decided for the ERA40 
reanalysis as driving dataset. 


5 Analysis of the Temperature Biases 

Albeit the spatial patterns for temperature between observed and simulated temper¬ 
atures are matched well, the large biases (Table 2 and Table 3) for the winter and 
fall season require further investigations. As all parameterization combinations are 
leading to underestimated temperatures of the WRF simulations within Domain 2, 
it has been speculated that the CRU data might overestimate temperature for this 
region. Figure 5 shows the differences between the gridded observation datasets 
CRU [8], DEL [6], and GLDAS [10] and the ERA Interim Reanalysis [1,11] for 
winter of the year 2000. Comparing CRU with alternative gridded observation 
datasets it is observed that CRU significantly overestimates the temperature during 
winter. Especially for Vietnam, this overestimation is obvious. Figure 6 illustrates 
the differences of the different datasets averaged over the grid cells corresponding 
to Vietnam. It is speculated that the overestimated temperatures of the CRU data 
result from the coarse network of observation stations within this topographically 
complex region. These observations are interpolated based on the surrounding eight 
stations. Fist inspection of the CRU derived interpolated elevation model based 
on the heights of the eight surrounding observation stations shows party strong 
deviations compared to the elevation model as used for the WRF model in Domain 2 
(not shown). As a consequence, the temperature estimates of the CRU grid cells can 
deviate significantly from model results. Comparing CRU with other observation 
datasets in regions with similar complex terrain and low observation densities could 
help to judge the qulity of the CRU dataset. However, it remains unclear why this 
deviation is stronger during winter and fall than during the other seasons. A more 
detailed analysis is required. 


6 Ongoing Activities and Future Work 

The operational long-term climate simulations are performed on monthly time 
slices using the restart option of WRF. This means that the results of the last time 
step of the previous month will be temporarily stored and used as input for the 
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Fig. 5 Mean temperature (° C) for January, February, and December (2000) for the datasets CRU, 
DEL, ERA Interim, and GLDAS 
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Fig. 6 Time series of monthly temperature for Vietnam (average values for the grid cells 
corresponding to Vietnam) using the different datasets CRU, DEL, ERA Interim and GLDAS 


subsequent month. For the continuous performance of the climate simulations batch 
processing will be applied. The model output will be stored in a 6 hourly resolution. 
Many surface climate variables are additionally stored in hourly resolution to meet 
specific requirements of the LUCCi project consortium. Besides the instantaneous 
variables, retained in a 6 hourly and hourly resolution, additional meteorological 
surface variables containing the magnitude and timing of the actual mimimum and 
maximum values of a day are retained. This will allow for detailed analysis of 
extreme events lateron. 

It is expected that the climate simulation will be finalized by end of 2012. 
The results will be provided to the LUCCi project consortium, but results can be 
provided on request to the climate change and climate impact community. 
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The Agulhas System as a Key Region 
of the Global Oceanic Circulation 


J.Y. Durgadoo and A. Biastoch 


1 Introduction 

The oceans around southern Africa form a unique system, impacting the regional 
and global climate [1], From the Indian Ocean to the Atlantic Ocean vigorous 
interoceanic exchange of warm and saline waters takes place that is subject to 
a complicated interplay between local dynamics and global embedment. Central 
element of the circulation around South Africa is the Agulhas Current [2] that flows 
poleward along the east coast, closely bound to the shelf at first, and subsequently 
overshoots the southern tip of Africa to abruptly turn back into the Indian Ocean. 
Part of the warm and saline waters with tropical Indian Ocean origin, the “Agulhas 
leakage” [3], flows into the Atlantic and forms the surface return flow of the global 
thermohaline circulation towards the North Atlantic [4], The exchange takes place 
in a highly nonlinear manner, with mesoscale eddies being separated from the 
retroflecting Agulhas Current, which then strongly interact in the Cape Basin [5]. 
West of the Cape Basin, large Agulhas rings that have been formed [6] transport the 
anomalous warm and saline waters into the South Atlantic. In addition to its own 
dynamics, the Agulhas Current system is influenced by nonlinearities in the source 
regions: mesoscale eddies originating from the Mozambique Channel and east of 
Madagascar [7, 8] drift southward and cause the Agulhas Current to be displaced 
offshore of its mean position by more than 100 km. These solitary meanders (a.k.a. 
“Natal Pulses”) [9] rapidly propagate downstream triggering the timing of Agulhas 
rings [10,11]. 

Owing to strong eddy mean flow interactions, the Agulhas leakage is a highly 
intermittent process. The dynamic imprint of the Agulhas mesoscale variabil¬ 
ity crosses the South Atlantic via baroclinic Rossby waves and is then rapidly 
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communicated via topographic waves along the South American shelf into the 
North Atlantic, an effect that is visible in the Atlantic Meridional Overturning 
Circulation (AMOC) [12]. In contrast, the advection of the anomalous water mass 
characteristics takes longer [13]. Experiments with hindcast experiments over the 
past 40 years have demonstrated that the volumetric amount of Agulhas leakage is 
subject to increase due to changing wind systems in the Indian and Southern oceans 
[14,15]. This has an important consequence since the additional salt import into the 
Atlantic will find its way to the north and has the general potential to stabilize a 
declining Gulf Stream system [16]. 

Despite its global importance, the dynamics of the Agulhas system is not yet fully 
understood and, owing to the strong nonlinear interactions and temporal variability, 
is far from being properly quantified. The notorious undersampling of the oceans 
around South Africa calls for dedicated studies within ocean general circulation 
models and emphasizes the importance of a reasonable representation of the current 
system in both modern and future coupled-climate models. 


2 INALT01 

The scarcity of continuous direct observations presents a general challenge 
to oceanographic research and this is particularly true in the Agulhas region. 
Consequently a modelling approach remains the most useful way to study this 
important area of the world’s ocean. The model used is based on the NEMO 
(Nucleus for European Modelling of the Ocean, v.3.1.1, [17]) code within 
the DRAKKAR [18] framework. NEMO is derived from the Navier-Stokes 
set of primitive equations as well as the nonlinear equation of state, coupling 
temperature and salinity to the velocity of the field. Assumptions are made (e.g. 
incompressibility of flow, hydrostatic hypothesis, Boussinesq fluid, among others) 
based on scale analysis. Being an ocean/sea-ice only model, boundary conditions 
at the surface need to be specified. This is achieved through bulk-formulae, where 
the pseudo-atmosphere and ocean exchange the necessary horizontal momentum 
and heat to drive the ocean. The sea-ice component employs a LIM2 two-level 
thermodynamic-dynamic model [19]. The configuration of NEMO used in this 
project (ORCA05, AG01 and INALT01) consists of a tri-polar horizontal grid (poles 
over Antarctica, Canada and Rusia) with variable arranged on an Arakawa C-Grid. 
In the vertical, 46 z-levels are used with partial cells for a better representation of 
bottom flow. 

The high-resolution AGO 1 model [1 1,12,14,20] has been amply used over the last 
years and has proven a success (viz. Sect. 3). However, it does have some limitations. 
The western extent of the refinement of the circulation only spans half of the South 
Atlantic basin. Because of this restriction, it is impossible, within this framework, to 
clearly distinguish the far-reaching impact of the Agulhas Leakage - a task that is of 
key relevance and has become of subject of active research. Furthermore, in addition 
to the limited time period of the hind-cast simulation (which is determined by 
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Fig. 1 Snapshot from INALT01 showing temperature (°C, color-scheme) and currents (gradients) 
at 450 m. The INALT01 configuration consists of an ocean/sea-ice global coarse resolution (0.5°) 
base model with a nest model embedded within, refining the horizontal grid (0.1°) over the greater 
Agulhas Current region, the South and Tropical Atlantic Ocean 

available atmospheric fields), AG01 is based on a previous version of the numerical 
code. In order to effectively address the above-mentioned limitations, INALT01 
(Fig. 1) was conceived. INALT01, inheriting from the experience of AG01, provides 
a high-resolution nest over the entire South Atlantic basin and the tropical Atlantic, 
with a hind-cast simulation spanning six decades. 

INALT01 comprises of two ocean models that are nested together; a global (base) 
model that represent the major oceanic components of the circulation with a nominal 
horizontal resolution of 50 km and a refined (nest) model that regionally enhances 
the horizontal resolution to 10 km (Fig. 2). The latter model is crutial for the correct 
representation of the Agulhas dynanics and being embedded within the former 
offers the possibility to diagnose any possible impact the Agulhas system may have 
on region of the oceans where climatically important changes are currently being 
observed. 

Running such a model configuration is computationally extremely demanding. 
With over 62 million combined grid points, both base and nest models of INALT01 
run concurrently by decomposing each respective domains horizontally onto 16 
processors. Including post-processing, a model year requires computation time 
of approximately 12-15 h on the NEC SX-9 at HLRS, depending on the time- 
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60*S 


Fig. 2 Bathymetry (colorbar in meters) illustrating the refinement of the nest over the greater 
Agulhas region and South Atantic compared to the global base host 


resolution of the output fields. A typical model year with 5-daily output fields 
generate 150 GB of raw data, a full 60-year similation requiring up to 9 TB. The high 
spacial and temporal resolutions are essential for a correct representation of the 
range of dymanics of the Agulhas Current system. 


3 Results 

3.1 On the Discontinuous Nature of the Mozambique Current 

The concept of a spatially continuous western boundary current in the Mozambique 
Channel has historically been based on erroneous interpretations of ships’ drift. 
Recent observations have demonstrated that the circulation in the Channel is instead 
dominated by anti-cyclonic eddies drifting poleward. It has therefore been suggested 
that no coherent Mozambique Current exists at any time. However, satellite and 
other observations indicate that a continuous current not necessarily an inherent 
part of Mozambique Eddies may at times be found along the full Mozambican 
shelf break. Using the high-resolution AG01 it has been demonstrated how such 
a feature may come about [21]. In the model, a continuous current is a highly 
irregularly occurring event, occurring about once per year, with an average duration 
of only 9 days and with a vertical extent of about 800 m (Fig. 3). Surface speeds 
may vary from 0.5 to 1.5 m/s and the volume flux involved is about 10 Sv. The 
continuous current may occasionally be important for the transport of biota along 
the continental shelf and slope. 
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Fig. 3 Simulated velocities in the Mozambique Channel at a depth of 93 m in the model for 
(a) 07 February 1994 and (b) 25 October 1993. (b) Shows the normal current configuration in 
which three strong anti-cyclonic eddies are evident, all heading poleward. In contrast, (a) presents 
the more unusual case of a continuous current along the full length of the Mozambican shelf; this 
current on this occasion stayed intact for about 5 days. Arrows indicate current directions [21] 



N m ~ 2 

Fig. 4 Wind shift and Agulhas leakage response, (a) Ensemble mean of zonally averaged 
wind stress (zonal component only) in the climate model under present-day (black line) and 
global warming (blue) conditions. The difference function (thick) was applied to the sensitivity 
experiments, (b) Annual Agulhas leakage for reference (dark bars) and sensitivity (light bars) 
experiments 


3.2 Anthropogenic Impact on Agulhas Leakage 

Climate model projections for the 21st century predict a progressive southward 
migration of the Southern Hemisphere westerlies, associated with a southward shift 
in the latitude of zero wind stress curl [22], The potential effects on the ocean 
circulation of such an anthropogenic trend in wind stress are studied here with a 
sequence of experiments with the high-resolution ocean model. The model suggests 
an increase of 5 Sv in Agulhas leakage (Fig. 4) in response to a 2°-shift in the 
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Fig. 5 Wind stress patterns ( left panels) over the Atlantic and Indian Ocean sectors of the Southern 
Hemisphere and their corresponding zonal averages ( right panels). Examples of sensitivity cases 
are shown whereby the reference winds ( top panel) are independently nudged towards an increase 
(lower top panel) or a shift (upper bottom panel) of the Westerly winds or an increase of the Trade 
winds ( bottom panel)', black and red curves representing reference and sensitivity respectively 
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zero wind stress curl line, associated with a southward expansion and spinup of 
the subtropical supergyre. The integral northward transport in the upper branch of 
the AMOC gradually increases by up to 1.5 Sv at 20-25S; this dynamic signal 
extends north across the equator and fades out in the subtropical North Atlantic. 
A main effect of the increasing inflow of Indian Ocean waters is the salinification of 
upper-thermocline waters in the South Atlantic which extends into the North Brazil 
Current regime within just one and two decades. 


4 On-Going Work 

The importance of the Agulhas Leakage has been highlighted in various published 
research [1,4]. Over the last few decades, there has been a noteworthy increase in 
the magnitude of Indian Ocean waters leaking in to Atlantic [14,15]. The cause of 
this increase has been and remains under debate owing to the intricate dynamics 
resulting in Leakage south of Africa. One of the external factors potentially 
determining the magnitude of Agulhas Leakage on various timescales is the 
Southern Hemisphere wind patterns, namely the Trades winds and the Westerlies. 
On-going work seeks to address this complex issue by applying smooth anomalies 
to the current wind pattern to effectively nudge the average winds to a different 
state and so-doing, systematically with each anomally applied in isolation (Fig. 5). 
Monitoring changes in Agulhas Leakage as a result of this detangling of individual 
impacts of these two wind belts helps in understanding the underlying dynamics. 
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Numerical Investigation of Stratified Turbulence 


S. Remmler and S. Hickel 


Abstract In order to evaluate turbulence subgrid-scale models for stably stratified 
flows, we performed direct numerical simulations (DNS) of homogeneous stratified 
turbulence with large-scale horizontal forcing. In these simulations we found that 
energy dissipation is concentrated within thin layers of horizontal tagliatelle-like 
vortex sheets between large pancake-like structures. For large eddy simulation 
(LES), we use an implicit subgrid-scale model, based on the Adaptive Local 
Deconvolution Method (ALDM). Our analysis proves that the implicit turbulence 
model ALDM correctly predicts the turbulence energy budget and the energy spectra 
of stratified turbulence, even though dissipative structures are not resolved on the 
computational grid. 


1 Introduction 

To predict atmospheric and oceanic mesoscale flows, we need to understand and 
parametrize small scale turbulence that is strongly affected by the presence of stable 
density stratification. The stratification suppresses vertical motions and thus makes 
all scales of the velocity field strongly anisotropic. Using aircraft observations, 
the horizontal velocity spectrum in the atmosphere was analyzed by Nastrom and 
Gage [23]. They found a power-law behavior in the mesoscale range with an 
exponent of —5/3. In the vertical spectrum, on the other hand. Cot [4] observed 
an exponent of —3 in the inertial range. 

There has been a long an intensive discussion whether the observed spectra are 
due to a backward cascade of energy [8, 10, 18] as in two-dimensional turbulence 
[15], or due to breaking of internal waves, which means that a forward cascade is the 
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dominant process [5,32], In different numerical and theoretical studies, ambiguous 
or even conflicting results were obtained [19]. 

During the last decade, a number of new simulations and experiments addressed 
the issue. Smith and Waleffe [28] observed a concentration of energy in the lowest 
modes in their simulations. Other studies [16,33] suggested that the character of 
the flow depends on the Reynolds number. Apparently, high Reynolds numbers 
are associated with stronger three-dimensionality and a forward cascade of energy. 
Riley and de Bruyn Kops [25] suggested that the flow can be strongly stratified 
but still turbulent if Fr 2 Re>l. Lindborg [20] presented a scaling analysis of 
the Boussinesq equations for low Froude and high Reynolds number. His the¬ 
ory of strongly anisotropic, but still three-dimensional, turbulence explains the 
horizontal k h 5 ' /3 spectrum as well as the vertical k~ 3 spectrum. On the basis of 
these findings, Brethouwer et al. [2] showed that the relevant non-dimensional 
parameter controlling stratified turbulence must indeed be the buoyancy Reynolds 
number & = Fr 2 Re. For £% 1. they predict stratified turbulence including local 

overturning and a forward energy cascade. In the opposite limit, for 
the flow is controlled by viscosity and does not contain small-scale turbulent 
motions. A detailed analysis of the spectral structure and spectral energy budget 
of homogeneous stratified turbulence based on direct numerical simulations is 
provided by Remmler and Hickel [24]. 

Since a full resolution of all turbulence scales is only possible for very low 
Reynolds numbers, many groups performed large eddy simulations (LES), which 
rely on a subgrid-scale model to represent effects of unresolved small-scale 
turbulence. For example, Metais and Lesieur [22] used a spectral eddy viscosity 
model, based on the eddy damped quasi-normal Markovian (EDQNM) theory. This 
required a flow simulation in Fourier space and the cut-off wavenumber to be in 
the inertial range. For LES in physical space, Smagorinsky models are widely used, 
either in the classical formulation [14] or with certain modifications for stratified 
turbulence that are usually based on the local Richardson number [6]. Better results 
can be obtained if the model constant of a Smagorinsky model is not prescribed, 
but computed by the dynamic procedure of Germano et al. [9]. This approach was 
successfully applied to stably stratified turbulent channel flow by Taylor et al. [30] 
and others. Staquet and Godeferd [29] presented a two-point closure statistical 
EDQNM turbulence model, which was adapted for axisymmetric spectra about 
the vertical axis. Recently, many groups presented regularized direct numerical 
simulations (DNS) of stratified turbulence, which means rather pragmatically 
stabilizing under-resolved DNS by removing the smallest resolved scales. This is 
usually achieved by a hyperviscosity approach [20] or by de-aliasing in spectral 
methods using the “2/3-rule” [1,7]. 

In practice, all SGS turbulence models suffer from the problem that the computed 
SGS stresses are of the same order as the grid truncation error. This typically leads 
to interference between SGS model and numerical scheme, that can manifest in 
instability and lack of grid convergence. This issue can be solved by combining 
discretization scheme and SGS model in a single approach. This is usually referred 
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to as “implicit” LES (ILES) in contrast to the traditional “explicit” SGS models. 
The idea of physically consistent ILES was realized by Hickel et al. [11] in 
the Adaptive Local Deconvolution Method (ALDM) for neutrally stratified fluids. 
ALDM provides a framework for the design, analysis and optimization of numerical 
discretizations with an implicit SGS model that is consistent with turbulence theory. 
Based on this method and ALDM for passive scalar transport [12], we developed 
an implicit SGS model for Boussinesq fluids. In the present paper, we evaluate the 
applicability of ALDM for stably stratified turbulence. 

We simulated forced homogeneous stratified turbulence in a triple-periodic box 
at different Froude and Reynolds numbers. We present not only ILES, but also 
LES with a standard Smagorinsky model (SSM) and a dynamic Smagorinsky model 
(DSM) as well as high-resolution DNS as benchmark solutions. 


2 Governing Equations 

The non-dimensional Boussinesq equations for a stably stratified fluid in Cartesian 
coordinates read 


V • u = 0 (la) 

3 r u + V • (uu) = —V p — V 2 u (lb) 

Fr^ Re o 

dtP + V ■ (pu) = —w + ——V 2 p (lc) 

Pr Re 0 


where velocities u= [m, v, w] are made non-dimensional by 2/, all spatial coordi¬ 
nates by the length scale Jz?, pressure by Z/ 1 , time by jZ/, and density fluctuation 
p = p* — p (p *: local absolute density, p: background density) by the background 
density gradient 22\dp/ dz\. The non-dimensional parameters are 


Fr 0 = 




Re 0 


v 

-, Pr = -, 

v a 


( 2 ) 


where N = \Jg/P q dp/dz is the Brunt-Vaisala frequency, v is the kinematic 
viscosity and a. is the thermal diffusivity. We chose a Prandtl number of Pr = 0.7, 
corresponding to values found in the atmosphere. Froude and Reynolds number are 
the parameters that control the flow regime. 

With the instantaneous values of kinetic energy Ek and kinetic energy dissipation 
Sk, we find the local Froude and Reynolds number as well as the buoyancy Reynolds 
number^ 1 , defined by Brethouwer et al. [2]: 


Fr = 


Fr 0 ^f £k 


Re 0 El 


2#. = ReFr 2 


^ E k ' 


Re = 


( 3 ) 
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In ILES by construction we do not have direct access to the value of s k , as only 
a small part of it is resolved. For LES, we thus estimate Sk from the total energy 
balance 


d, (E t ) = 3, (Ek) + dr ( E p ) = P-e k -s p = P-( 1 + r)s k , (4) 

where the temporal change of total energy in the flow 3, ( E t ) can be computed from 
the energy levels at subsequent time steps, P is the power inserted into the system 
by the external forcing and r = E p /e k is the mixing ratio assumed to be constant 
r = 0.4, which is an acceptable approximation for a wide range of parameters [24]. 
Here, e p is the dissipation rate of potential energy. The kinetic energy dissipation 
rate can then be computed from 


Sk = -dr(Er)). 


( 5 ) 


3 Numerical Method 
3.1 Flow Solver 

Our finite-volume solver INCA offers different discretization schemes depending on 
the application. For DNS and LES with SSM and DSM, we used a non-dissipative 
central difference scheme with fourth order accuracy for the convective terms and 
second order central differences for the diffusive terms and the continuity equation 
(Poisson equation for pressure). For implicit LES, we replaced the central difference 
scheme for the convective terms by the implicit turbulence model ALDM. All 
computations were run on Cartesian staggered grids with uniform cell size. 

For time integration, we used an explicit third-order accurate Runge-Kutta 
scheme, proposed by Shu [26]. The time step was dynamically adjusted to keep 
the CFL number smaller than unity. 

The Poisson equation for the pressure is solved in every Runge-Kutta sub step. 
The Poisson solver employs fast Fourier-transform (FFT) in the vertical direction 
and a Stabilized Bi-Conjugate Gradient (BiCGSTAB) solver [31] in the horizontal 
plane. By the FFT, the three-dimensional problem is transformed into a set of 
independent two-dimensional problems, which can be solved in parallel. 

Our code INCA is parallelized both for distributed and shared memory usage. 
Additionally, it is optimized for running on vector computer systems. For the 
computations presented here, we use the single domain shared memory approach 
for efficient computation of Fourier transforms of the whole data set. On this single 
domain, we use the openMP shared memory parallelization capabilities of INCA. 
The limiting factor for the single-domain approach is the size of the shared memory 
of the computer. We found excellent conditions on the NEC SX-9 vector computer 
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at the High Performance Computing Center Stuttgart (HLRS). One node of the 
SX-9 provides 510 GB shared memory for 16 vector CPUs. This enabled us to run 
DNS with up to approximately one billion cells. The computation time was about 
70 ns per time step and cell. The computational performance of the pressure Poisson 
solver reached approximately 19GFLOP/s. The computation of the numerical flux 
function reached 35 GFLOP/s and the reconstruction routine 49 GFLOP/s, which is 
half the nominal peak performance of the SX-9. 


3.2 The Adaptive Local Deconvolution Method 

Our approach to 1LES is based on a nonlinear finite-volume scheme involving 
a solution adaptive reconstruction (deconvolution) of the numerical solution. The 
nonlinear discretization scheme generates a certain controllable spectral numerical 
viscosity. Using an evolutionary optimization algorithm, free parameters of the 
discretization scheme have been calibrated in such a way that the effective spectral 
numerical viscosity is identical to the spectral eddy viscosity from turbulence theory 
for asymptotic cases [11]. For a detailed description of ALDM, we refer to Hickel 
et al. [11]. The extension of ALDM to stably stratified media is provided by 
Remmler and Hickel [24]. 


3.3 Standard Smagorinsky Model 

If we apply a generic spatial filter (denoted by an overbar) to the dimensional 
momentum equation, we obtain 

d t Uj + d Xj (u,Uj) + d Xi p = d Xj (2vS u ) - d Xj t y. (6) 

where Sy = 0.5 (3 Xi uj + d Xj Uj ) is the filtered strain rate tensor and r y = ujuj — 
HiUj is the unknown SGS stress tensor, which has to be modeled. With a Boussinesq 
approach the SGS stress tensor is modeled as 

Ty° d = -2 V,Sy. (7) 

The eddy viscosity concept is common to many SGS turbulence models. 
Smagorinsky [27] estimated the unknown eddy viscosity v, from 

u, = ( C S A ) 2 \S\ ; \S\ = y/lSySy, (8) 

where A = (A x A y A z ) is the grid or filter size, respectively. In this formu¬ 
lation, the unknown SGS fluxes can be computed directly from the resolved 
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velocity field. There is no universal value for the model constant Cs ; for different 
flow configurations different values of the constant have been found to be optimal. 
In our simulations we use a value of Cs = 0.18, which follows from theory of 
isotropic turbulence [17] and has been found to yield good results in practice [3]. 
The buoyancy equation is closed analogously by an eddy diffusivity model with 
a, = Vr/Pr,. 


3.4 Dynamic Smagorinsky Model 


The case-dependence of the value for the model constant in the standard Smagorin¬ 
sky model led to the idea of replacing the constant by a dynamic parameter, which 
automatically adjusts to the flow conditions. Germano et al. [9] presented a general 
dynamic procedure for eddy viscosity models and applied it to the Smagorinsky 
model. The basic idea is a similarity between the interactions of the smallest 
resolved scales and unresolved scales compared to the interactions between medium 
scales and the smallest resolved scales. 

The solution is available in its filtered form Ti with a grid filter width A. This 
filtered velocity held is explicitly filtered by a test filter with a larger filter width 
A. As a test filter, we use a top-hat filter with A = 2A. The subfilter-scale stress 
tensor is Ty = ujuj—TIJ TTJ ■ It cannot be computed directly from the filtered velocity 
held, but one can compute the Leonard stress tensor Ly = UjTiJ— uJAJ. Using the 
Germano identity 

Tij = Ljj + Ty (9) 

and the standard Smagorinsky model for Ty and 7), , we can minimize the difference 
between Ly and 


' mod 


Ty° d (C, A , u) - Ty° a (C, A , u) 


mod / 


= -2CA z \S\Sy + 2C\A \S\Sy) =2CMy 


( 10 ) 


by a least-squares procedure 

c= 1 LyMy 

2 My My' 


(ID 


We apply this dynamic procedure in every time step to obtain the model parameter 
Cj = C for the how. Since the presently investigated hows are homogeneous in 
all spatial directions, numerator and denominator of Eq. (11) are averaged in space 
before evaluating Eq. (11). So finally, the parameter Cs is spatially constant, but can 
vary in time. 
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Fig. 1 Regime diagram for our simulations of stratified turbulence 


4 Results and Discussion 

We investigated homogeneous stratified turbulence in a statistically steady state. 
The flow is maintained at an approximately constant energy level by a large scale 
vertically uniform forcing of the horizontal velocity components. This approach 
models the forcing by the synoptic scale flow in the atmosphere and was success¬ 
fully applied by several authors before [20,21, 33]. 

We ran two series of DNS, series A with Reo = 6,500 and series B with 
Reo = 13,000. The domain size was 320 3 cells for series A and 640 3 cells for 
series B. Within the single series, the Froude number was varied to cover different 
buoyancy Reynolds numbers. The basic domain size again was 2n.£. For low 
Froude numbers, we used a flat domain with a height of only , but keeping 
cubical cells. This is permitted since in stratified turbulence there is only a very 
small amount of energy contained in the large scale vertical modes. 

For both series, we performed LES, both with implicit ALDM and explicit SSM 
and DSM. For all these simulations, we used grid boxes with 64 3 cells. For the low 
Froude number simulations, the domain was flattened as well, leading to a doubled 
resolution in vertical direction. Figure 1 shows the local Froude and Reynolds 
number of the simulations. 

Most important for the assessment of a parametrization scheme for stratified 
turbulence is its ability to correctly predict the amount of energy converted from 
horizontal kinetic energy to vertical kinetic energy and available potential energy 
before the energy is finally dissipated on the smallest represented scales. In Fig. 2, 
we show the ratio E v /E p as a function of local Froude number as predicted by 
DNS and LES with ALDM. The ratio E v /E p is not influenced by the forcing and 
can thus freely develop according to the dynamic interaction of convective, pressure 
and buoyancy term. 
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Fig. 2 Ratio of vertical to potential energy in HST as a function of local Froude number 


The vertical to potential energy ratio increases almost linearly with Froude 
number in the DNS. We find the same trend in our LES with ALDM. The agreement 
between DNS and LES is best in the region of high Froude numbers (weakly 
stratified turbulence), whereas for low Froude numbers the vertical kinetic energy 
is slightly underpredicted. Note that the results from ALDM and SSM differ from 
each other most at the lowest Froude number. This is an indication for ALDM being 
better capable of handling the strong turbulence anisotropy in strongly stratified 
flows. 

For a visual comparison of neutrally and stably stratified turbulence, we show 
the results of two computations from series B in Figs. 3 and 4. Both show snapshots 
of the developed turbulent flow at comparable Reynolds numbers. The turbulent 
structures are visualized by iso-surfaces of the Q -criterion [13]. For presentation 
purposes, the visualization includes only slices of finite thickness at the domain 
boundaries. 

Figure 3 shows a case with neutral stratification (Re = 8,440). For better 
comparability with the stratified case, we show only the lower half of the cubical 
computational domain. The visualization shows that the turbulent structures have 
no preferred orientation. This proves that isotropic turbulence can be generated 
by the horizontal large-scale forcing that we used. Furthermore, we observe a 
spatially intermittent field of turbulence which reflects the remaining large-scale 
anisotropy. There are regions of higher and lower density of turbulent vortices. The 
regions with strong turbulence activity are associated with much higher values of 
molecular dissipation rate. Note that the color scale of normalized dissipation rate 
is logarithmic. In the (rare) red regions, the dissipation rate is more than 30 times 
higher than the instantaneous spatial average in the whole domain. 

In case of strongly stratified turbulence (Re = 7,110, Fr = 0.017, = 2.1, 

cf. Fig. 4), the turbulence structures look completely different. Although Reynolds 
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Fig. 3 DNS of neutrally 
stratified turbulence 
(Re = 8,440). Iso-surface of 
Q , color of iso-surface and 
shading of domain faces by 
local molecular dissipation 
rate (normalized by the 
spatial average) 



Fig. 4 DNS of strongly 
stratified turbulence 
(Re = 7,110, Fr = 0.017). 
Coloring as in Fig. 3 



number and mean total energy dissipation are similar to the neutrally stratified case, 
the smallest eddies are much larger. Additionally, all eddies are aligned horizontally. 
Despite the larger vortices, the molecular dissipation rates are comparable to the 
neutral case. This is due to the intensified shear between the horizontal layers. 

We present vertical and horizontal cuts of the DNS domain in Figs. 5 and 6. In 
the neutrally stratified case, we can still see the signature of the horizontal forcing 
in the velocity magnitude plots (left panels of Fig. 5). There are large-scale column¬ 
like regions of higher or lower velocity, superposed by a lot of small-scale variation. 
Apparently, this does not affect the behavior of the small scales. The signature 
of the column structures is missing in the contour plot of molecular dissipation 
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Fig. 5 Vertical (x = 0) and horizontal (j = 2.4) cuts through a DNS result of neutrally stratified 
turbulence. Left panels', velocity magnitude, right panels: molecular dissipation 




(right panels). There is hardly any correlation between velocity magnitude and 
molecular dissipation. Small turbulence scales “forget” about the anisotropic large- 
scale forcing. 

In case of strong stratification (Fig. 6), we find a strong horizontal layering, as 
described in many studies before. The flow is separated into thin layers of very 
different kinetic energy. The flow laminarization due to stratification reduces the 
amount of small-scale variations of velocity in horizontal planes compared to the 
neutrally stratified case. Neighboring layers can have strongly different (horizontal) 
velocities, which results in the formation of unstable shear layers. We observe a 
lot of Kelvin-Helmholtz-like structures all over the domain (cf. upper right panel of 
Fig. 6), which are responsible for the major part of energy dissipation. The signature 
of these Kelvin-Helmholtz layers can also be observed in the horizontal cut of the 
domain. 

Many authors refer to the structures in stratified turbulence as “pancake” vortices. 
This popular image is used to describe large flat vortices rotating around a vertically 
oriented core. We observe large flat structures in the flow, but they are far from 
being coherent quasi-two-dimensional vortices. Instead, we find only very weak 
rotation around the vertical axis. The dominant turbulence structures are small-scale 
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Fig. 6 Vertical {x = 0) and horizontal (z = 0.1) cuts through a DNS result of strongly stratified 
turbulence. Left panels', velocity magnitude, right panels', molecular dissipation 



Kelvin-Helmholtz vortices. These vortices are dominating the energy dissipation in 
the stratified turbulent flow. They appear in horizontal patches between two layers of 
strongly differing velocities, but their axis of rotation is basically horizontal. To stay 
within the culinary images, these vortices may better be described as “tagliatelle” 
rather than pancakes. 

In Fig. 7 we present the contours of velocity magnitude for ALDM computations 
of the two cases presented before. The LES resolution was 64 3 , so the total number 
of cells was three orders of magnitude lower than in the DNS. The global flow 
structure resembles well the DNS result, of course without the unresolved small 
scale content. In the strongly stratified case, we observe the same layering with large 
horizontal scales as in the DNS. Even the number and thickness of horizontal layers 
is similar. The major difference between LES and DNS consists in the horizontal 
Kelvin-Helmholtz vortices. Their vertical extension, as found in the DNS, is of the 
same order of magnitude as the vertical resolution in the LES. Hence, they form 
a typical subgrid-scale problem and their dissipative effect is accounted for by the 
implicit turbulence model. 

For comparison of kinetic energy spectra, we selected one DNS in the weakly 
stratified regime (^ = 41, Fr = 0.07, Re = 9,300; cf. Fig. 8) and one DNS in the 
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Fig. 7 Vertical (x = 0) and horizontal (left panel', z = 1.6, right panel : z = 0.9) cuts through 
ALDM results (velocity magnitude) of neutrally and strongly stratified turbulence 


a b 




Fig. 8 Weakly stratified turbulence kinetic energy spectra Q%=41). (a) Horizontal spectra, 
(b) Vertical spectra 
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Fig. 9 Strongly stratified turbulence kinetic energy spectra = 6.3). (a) Horizontal spectra, 
(b) Vertical spectra 


strongly stratified regime (i^ = 6.3, Fr = 0.03, Re = 8,300; cf. Fig. 9), both from 
series B. A more detailed analysis of these computations in spectral space is 
provided by Remmler and Hickel [24] . The corresponding LES have similar local 
Froude and Reynolds numbers. 

In the horizontal spectra of kinetic energy, the differences between ALDM and 
the explicit SGS models are most obvious. In the weakly stratified case {£% = 41), 
the horizontal spectrum is still quite similar to the Kolmogorov spectrum of isotropic 
turbulence. In this case, all three SGS models predict the inertial range spectrum 
fairly well. The SSM and DSM are slightly too dissipative, but the difference to the 
DNS spectrum is acceptable. Things completely change for the stronger stratified 
case {M = 6.3). The SSM dissipates too much energy and thus underpredicts the 
inertial range spectrum by more than one order of magnitude. Additionally, the 
predicted power-law exponent is significantly lower than —5/3. With the DSM, 
the overall prediction of the spectrum is much better than with the SSM, but 
the power-law exponent near the cut-off wavenumber is greater than —5/3. The 
spectrum predicted by ALDM agrees better with the DNS than results for both 
Smagorinsky models. Only ALDM correctly predicts the characteristic plateau 
region between the forcing scales and the inertial scales. Moreover, ALDM produces 
a power-law decay with an exponent of —5/3, corresponding to the DNS and theory 
derived from scaling laws [2] . 

We note that the dynamic model coefficient in the DSM is approximately 
constant in time as soon as the statistically steady state is reached. The dynamic 
procedure is efficient in choosing a proper coefficient depending on the strength of 
the stratification, but cannot cure the more structural weakness of the isotropic eddy 
viscosity approach. 
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In the vertical spectra of kinetic energy, the inertial range decay exponent changes 
from —5/3 in neutrally stratified fluid to —3 in strongly stratified turbulence. We find 
this change in the DNS and it is well reproduced by the LES. All three SGS models 
predict the turbulence inertial range decay well. At strong stratification, the ALDM 
result perfectly agrees with the DNS. The SSM result is slightly too dissipative in 
this region. 


5 Conclusion 

We presented a numerical investigation of homogeneous turbulence in a stably 
stratified fluid to proof the reliability of implicit turbulence modeling with the 
Adaptive Local Deconvolution Method (ALDM). As benchmark results, we used 
high resolution DNS data and LES results obtained with an explicit standard 
(SSM) Smagorinsky model and a dynamic Smagorinsky model (DSM). In most 
simulations, the buoyancy Reynolds number was larger than unity. The Froude 
and Reynolds number were chosen to cover the complete range from isotropic 
Kolmogorov turbulence up to strongly stratified turbulence. 

We analyzed results from DNS of homogeneous turbulence with and without 
stable stratification. As in previous studies, we found a strong horizontal layering 
in the strongly stratified cases. Energy dissipation is concentrated within thin layers 
of Kelvin-Helmholtz vortices. Although these dominant vortices are not resolved in 
the LES, the LES results (with ALDM) agree well with the reference DNS, both in 
integral flow properties and energy spectra. This applies to the whole Froude number 
range from infinity down to very low values. Especially in the strongly stratified 
regime, ALDM performs better than the SSM. While the SSM is far too dissipative 
in this case, ALDM spectra agree very well with the reference DNS. With the DSM, 
the excessive dissipation of the SSM can be avoided, but the spectral slope near the 
cut-off wavenumber is still not correct. Among the investigated SGS models, only 
ALDM predicted the correct exponent of —5/3. 

The results presented here were obtained without recalibrating the ALDM model 
constants for stratified turbulence. The good agreement with DNS data shows 
the ability of ALDM to automatically adapt to strongly anisotropic turbulence. 
Within the continuation of the project, we will investigate to which extend the 
results can be further improved by a recalibration of the model coefficients for 
stratified turbulence. But even without this possible improvements, ALDM provides 
a suitable parametrization of geophysical turbulence. After having applied ALDM 
successfully to rather simple model flows, we will continue working on more 
complex geophysical problems. 
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Part VI 

Miscellaneous Topics 


Univ.-Prof. Dr.-Ing. Wolfgang Schroder, Prof.Dr.-lng. Peter Wriggers, 
and Univ.-Prof. Dr. Hans-Joachim Bungartz 


The following contributions will show that numerical simulations represent useful 
tools to gain novel results and as such to improve the scientific knowledge not only 
in physics and/or solid and fluid mechanics. The papers evidence the close link 
between applied and fundamental research. In other words, the close relationship 
between mathematics, computer science and the ability to develop scientific models 
is clearly shown. Compact mathematical descriptions will be solved by highly 
sophisticated and efficient algorithms on high-end computers. This interdisci¬ 
plinary collaboration between several scientific fields defines the extremely intricate 
numerical challenges and as such drives the progress in fundamental and applied 
research. The subsequent articles represent an excerpt of the vast amount of projects 
being linked with HLRS. The computations are not only used to obtain some 
quantitative results but to confirm fundamental physical models and to even derive 
new theoretical approaches. Nevertheless, it has to be emphasized that it will always 
make sense to substantiate numerical simulations by experimental investigations and 
analytical solutions. 

In the first contribution of the Goethe University Frankfurt a software framework 
is discussed. The modeling of physical phenomena in a variety of fields of 
scientific interest leads to a formulation in terms of partial differential equations. 
Especially when complex geometries as the domain of definition are involved, a 
direct and exact solution is not accessible such that numerical schemes are to be 
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used to compute an approximate discrete solution. The focus is on elliptic and 
parabolic equations that include spatial operators of second order. Then discretising 
such problems using commonly known discretization schemes like finite element 
methods or finite volume methods, large systems of linear equations arise naturally. 
Their solution takes the largest amount of the overall computing time. A powerful 
algorithm for the solution of systems of linear equations is the multi-grid method. 
The discussion deals with the scalability of several multi-grid approaches to show 
that the presented methods are well-suited to be used on parallel supercomputers. 
All methods are included in the simulation framework called ug4. This framework 
is designed for the solution of differential equations on unstructured hybrid grids in 
one, two, and three space dimensions on systems, ranging from laptops to massively 
parallel computers. 

In the contribution of the University of Paderborn the capability of molecular 
simulation to determine fluid phase equilibria is examined, especially with respect 
to the problems of the standard approach in which the data for the design of 
such thermodynamic processes are aggregated by empirical correlations, which 
are based on experimental data. Molecular modeling and simulation is a modern 
route for the prediction of thermodynamic properties. Being based on mathematical 
representations of the intermolecular interactions, it has strong predictive capabil¬ 
ities as it adequately covers structure, energetic and dynamics on the microscopic 
scale that govern the fluid behavior on the macroscopic scale. The capability of 
previously discussed models to predict the liquid-liquid equilibrium (LLE) data 
is studied for the binary mixture Nitrogen and Ethane. The decomposition of 
a randomly distributed mixture of these two components in their LLE phases 
is investigated. Subsequently, the composition of the phases in equilibrium is 
compared to experimental data. It is shown that vapor-liquid equilibrium (VLE) and 
LLE can be predicted by molecular simulation with good agreement to experiments. 

The next article of the German Aerospace Center, Berlin and the University of 
Munster discusses a particle-in-cell-method to simulate the impact of partial melt on 
mantle convection. Solid-state convection is the principal mechanism that controls 
the global dynamics and thermal evolution of the terrestrial planets. Observations 
such as seismology and mission data from geological structures at the planetary 
surfaces show important constraints for the interior dynamics. However, the main 
knowledge stems from laboratory experiments and in particular from computer 
models. In the last years, due to significant improvements in high performance com¬ 
puting, computer simulations have offered a powerful access to this fluid-dynamical 
problem by approximately solving partial differential equations to describe the flow 
in space and time. Results will be presented from the cylindrical/spherical code 
GAIA which is based on a particle-in-cell method to account for compositional 
changes due to partial melting of the mantle. 

The fourth contribution of the Friedrich-Schiller University, Jena also deals 
with mantle convection. Some essential features of Andean orogenesis cannot be 
explained only by a dynamic regional model since there are essential influences 
across its vertical boundaries. A dynamic regional model of the Andes should be 
embedded in a 3-D spherical-shell model. Because of the energy distribution on 
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the poloidal and toroidal parts of the creep velocity and because of geologically 
determined mass transport alongside the Andes, both models have to be three- 
dimensional. A new viscosity profile of the mantle with very steep gradients at the 
lithospheric-asthenospheric boundary was developed. Based on the new viscosity 
profile a new forward spherical-shell model was introduced. For this model a new 
extended acoustic Grtineisen parameter, y ax , new profiles of the thermal expansivity, 
a, and of the specific heat, c v , at constant volume as well as a solidus depending 
on both the pressure and the water abundance were derived. A regional model 
of the Andean orogenesis with the same new viscosity profile was introduced. In 
connection with another spherical-shell model the regional model is supposed to 
numerically explain why a plateau-type orogen evolved at an oceanic-continental 
plate boundary. 

In the project of the University of Stuttgart the gravity gradiometry data with 
near global coverage from a single source, namely the satellite mission GOCE 
(Gravity field and steady-state Ocean Circulation Explorer) launched on 17 March 
2009, are investigated. The question to be addressed is whether geodesy can benefit 
from Euler deconvolution, e.g., to retrieve global gravity models, or is able to 
contribute its methodologies to enhance Euler deconvolution. Until now the project 
is still in preparatory stage, mainly, because the GOCE gradiometry data need to be 
preprocessed. 

In the article from the University of Hohenheim in Stuttgart economic capital 
allocation in banking is considered. The model was described as a mixed integer 
nonlinear program (MINLP) and an appropriate solving algorithm in form of thresh¬ 
old accepting was introduced. In the current contribution the parameterization of the 
algorithm is addressed. The implementation of threshold accepting requires certain 
model modifications which also affect the parameterization procedure. Using an 
adequate parameterization the model provides an indication of optimum economic 
capital allocation’s superiority compared to alternative allocation methods. 

Structural mechanics and material mechanics are areas that include demanding 
problems as well from the theoretical side as from the computational point of view. 
On one hand, based on the computing power, physical models can be refined and 
enhanced such that a virtual testing of complex behaviour is possible on different 
length and time scales. On the other hand, algorithms become more and more 
complex and need to be stabilized in order to obtain robust and reliable results. 
Challenges lie in the development of new computational methodologies and the 
reproduction of correct physical behaviour. 

The two contributions that relate to structural mechanics are concerned with the 
virtual testing of heterogeneous materials including damage effects using multiscale 
approaches. The work by Schrader and Konke provides insight in the behavior of 
different parallel computer architectures when applied to large scale problems. The 
implementation of solution algorithms was investigated by comparing algorithms 
for a linear solve within a nonlinear solution algorithm on the hybrid CPU- 
GPU NEC Nehalem cluster and the CRAY XE6 system. While this contribution 
was concerned with linear solvers, the second contribution of Eck et al. devoted 
its attention to numerical sensitivities in complex impact problems, e.g. crash 
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simulations. The question was how can these nonlinear simulations be done in a 
robust manner, meaning that the results have to be reliable and reproducible for the 
same mesh in different runs when large number of unknowns are present. Different 
solutions are caused either by physical or by numerical instabilities. The aim of the 
project was to reduce scatter of the responses by identification of the sources of 
scatter through parameter studies and to improve the robustness by an enhancement 
of existing mathematical formulations and algorithms. This is especially needed 
when these simulations are executed on parallel computing systems with large 
numbers of unknowns. 

There have been two projects with an application background from informatics: 
“Characterization of Carrier Sense Multiple Access in Vehicular Propagation 
Channels (CAR2X)” and “HMDB51 - A Large Video Database for Human Motion 
Recognition (HMDB)”. Both projects are computationally intense, but not primarily 
focused on parallel computing or HPC issues. 

CAR2X is a successor of previous related projects. Wireless communication 
between vehicles is considered to be important to increase the safety level of 
future smart transportation systems. While it sounds intuitively clear that a periodic 
exchange of status information may help to avoid dangerous traffic situations, it is 
not clear whether the envisioned communications system is sufficiently reliable and 
robust, whether the employed Carrier Sense Multiple Access (CSMA) mechanism 
is able to coordinate concurrent access by multiple network nodes in a highly 
dynamic environment as intended. Therefore, the performance of a CSMA-based 
coordination mechanism is evaluated, based on a network simulation framework 
that emulates the signal processing steps of a transceiver and accurately models 
the multi-path propagation effects of the wireless vehicular radio channel. Due 
to this accuracy requirements, the execution of such high-fidelity simulations is 
computationally highly expensive. 

The project HMDB addresses recognition and search in videos, a highly relevant 
topic against the background of millions of online videos viewed every day. While 
much attention has been paid to the collection and annotation of large static image 
datasets containing thousands of image categories, human action datasets lag far 
behind: Current action recognition databases contain on the order of ten different 
action categories. To address the important issues of collecting and benchmarking, 
the largest action video database to-date with 51 action categories, which in total 
contain around 7,000 manually annotated clips extracted from a variety of sources 
ranging from digitized movies to YouTube, was collected. The goal is to provide a 
tool to evaluate the performance of computer vision systems for action recognition 
and to explore the robustness of these methods under various conditions such as 
camera motion, viewpoint, video quality, and occlusion. 


Software Framework ug4: Parallel Multigrid 
on the Hermit Supercomputer 


Ingo Heppner, Michael Lampe, Arne Nagel, Sebastian Reiter, Martin Rupp, 
Andreas Vogel, and Gabriel Wittum 


1 Introduction 

The modeling of physical phenomena in a variety of fields of scientific interest lead 
to a formulation in terms of partial differential equations. Especially when complex 
geometries as the domain of definition are involved, a direct and exact solution is 
not accessible, but numerical schemes are used to compute an approximate discrete 
solution. In this report, we focus on elliptic and parabolic types of equations that 
include spatial operators of second order. When discretizing such problems using 
commonly known discretization schemes such as finite element methods or finite 
volume methods, large systems of linear equations arise naturally. Their solution 
takes the largest amount of the overall computing time. 

A powerful algorithm for the solution of systems of linear equations is the 
multigrid method. A general introduction to this topic can be found, e.g., in [3]. 
It is known to have optimal complexity O(N), where N is the number of unknowns 
of the discrete system. In this report, we focussed on the scalability of several 
multigrid approaches to show that the presented methods are well-suited for the 
usage on parallel supercomputers. All methods are included in the ug4 simulation 
framework [9]. ug4 is designed for the solution of differential equations on 
unstructured hybrid grids in 1, 2 and 3 space dimensions on systems, ranging from 
laptops to massively parallel computers. It is written in C++, striving for a flexible, 
yet fast and robust simulation environment. 
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Fig. 1 Schematic overview of the multigrid setup 


All computations were performed on the Cray XE6 “Hermit” at HLRS Stuttgart. 
The XE6 is a 3,552 node cluster. Each node is a Dual Socket AMD Opteron 6276 
(Interlagos) @ 2.3 GHz 16 cores each, which results in 113.664 cores in total. The 
nodes we were using have 32 GB RAM, resulting in 1 GB RAM per core when using 
the maximum of 32 cores/node. 

The remainder is organized as follows: Sect. 2 briefly summarizes ingredients of 
the multigrid methods. Section 3 is devoted to simple scalability benchmark tests. 
Section 4 presents an application of the methods for solving problems from density 
driven flow. 


2 Methods 

2.1 Geometric Multigrid 

In this section, we present the idea of our implementation of a parallel geometric 
multigrid method. A detailed description is given in [7]. The parallelization has 
been implemented using MPI. 

The main aspect of the parallelization is the distribution of the multigrid 
hierarchy to the processes using a hierarchical approach. See Fig. 1 for an illustration 
of a possible assignment of a Id problem to two processes. For the finest grid 
level of the multigrid, this can be thought of as a partition of the domain with 
respect to the processes. At the process borders within each grid level we build 
up horizontal interfaces that allow the communication between nodes that are 
representing the same global object but are a local copy for each process. Going 
down in the multigrid hierarchy, at a certain point the grid levels become very sparse 
and therefore we start to gather parts of the grids on some processes while the other 
processes have no grid part on the coarser levels. When traversing the multigrid in 
vertical direction during a multigrid cycle, this will involve communication of data 
between the gathering processes and those that have no coarser grid level. To this 
aim, we introduce vertical interfaces that allow the communication in this direction. 
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The multigrid cycle is performed as usual: starting on the finest grid, where 
all processes are involved, some smoothing is performed on each grid level using 
the horizontal interfaces. Going down in the multigrid hierarchy the restriction is 
performed process-locally until a vertical cut is encoutered. At this point, the vertical 
interfaces are used to transfer the data to the gathering process and the sending 
processes will idle. Those processes that still have a multigrid part will continue 
with the multigrid cycle. Once at the bottom, this procedure is performed the way 
up, again using vertical interfaces, to bring all processes back to work at the end. 

The success of this algorithm relies on the fact that on coarser grids the amount 
of computational work is very small compared to the smoothing effort on the 
finer grids. Therefore, the part of the algorithm where idle processes are present is 
negligible compared to the smoothing parts on finer grid levels where all processes 
are involved. When dealing with very large number of processes, i.e. thousands and 
more, we introduce several vertical cuts at certain points of the multigrid cycle. This 
leads to a tree-structure of gathering during the multigrid cycle and is important to 
keep the costs for the one-to-many communication using vertical interfaces within 
a reasonable size. 


2.2 Filtering Algebraic Multigrid 

Geometric multigrid methods, as introduced in the previous subsection, employ a 
hierachy which is built up from coarse to fine. Thus, the distribution of the grids 
can be controlled by a grid manager from the start. In contrast to this. Algebraic 
Multigrid methods, see e.g., [8], are building up the multigrid hierarchy from fine to 
coarse grids. Ideally, methods of this type only rely on matrices and graph structures 
provided by the user. This work focuses on the Filtering Algebraic Multigrid method 
(FAMG, [6, 1 1]). The key ideas can be summarized as follows: 

An essential ingredient of any multigrid method is an efficient interplay between 
smoother and coarse grid correction, i.e., those components of the error which are 
not reduced well by the smoother S on the fine grid must be treated by a coarse 
grid correction. We call those components algebraically smooth vectors. For this 
we need a restriction operator R (mapping defects from finer to coarser grids), an 
interpolation operator P (mapping corrections from coarser to finer grids), and a 
coarse grid operator Ah- 

In FAMG, we achieve this in the following way: Let t be a vector, which is known 
to be characteristic for the subspace in which the smoother does not converge well. 
Now the algorithm constructs the operators P and R so that 

• The vector t is contained in Range! P ) 

• The two grid correction T = (/ — PAj^ RA/,)S is efficient, i.e., minp ||7 T || 

For the sake of illustration, we restrict ourselves to the symmetric case, i.e., A = A T , 
which leads to the letting R 1 = P. Note that there are extensions to FAMG for non- 
symmetric matrices, systems of equations and multiple testvectors. 
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Global and Local Minimization Problems 

As shown in [ 1 1], it suffices to use the injection operator R = R m \ so that we obtain 
the following minimization problem: 

min || (7 — PR^)S\\ p-i , 

s.t. (7 - PR (,nj) )St = 0 (1) 

Here D = diagonal of A/, and |-|| D -\ is the induced operator norm of the vector 
norm HxH^-i = y/{D~ l x , x). 

In order to express this locally, we assume that a fine node i is interpolated by 
coarse nodes which are neighbors of i in the adjacency graph of the matrix A /,. 
Given a partitioning of all nodes in coarse (C) and fine ( F ) and q, = /-th row of 
(I — PR m i), we can rewrite the problem 1 as a sum over local (i.e. independent) 
minimization problems: 


V' min|a ;; |||5 ,:r ^ || 7 ) _ 1 

z —' qi 

ieF 

s.t. (qi, S't) = 0 Vi € F (2) 

This leads to Algorithm 1 . 


Algorithm 1 GetPossibleParentNodes 
1: for all nodes i do 

2: get tj : (local) representation of the testvector 

3: Calculate for all neighbor pairs n ^ m e A; 

Ri,nm nun a,-; IIS qinm II n —1 

s.t. (q ijm „ S't) = 0 

4: if 6F inm < min F iAb and F Unm < S, S < 1 then 

a^bGNi 

5: save the pair (n, in) in PN 1 and their quality value F; „ m . 

6: end if 

7: end for 


(3) 


After that, a set of possible parent nodes PN' is assigned to every node i . 


Used Smoother S' 

In the construction of the interpolation operator, we only need to focus on 
interpolating smooth vectors. For this, we use a special smoothing operator S'. S' 
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consists of one Jacobi-Step 5j ac = I — wD 1 A, followed by a Jacobi-Step updating 
only fine nodes (F-smoothing): 


so we obtain 


Sjac.F =I-J2 e i e f D ~ lA 
ieF 


S'= Si 


jac,F *^jac 


(4) 

(5) 


Coarsening Algorithm 

The FAMG coarsening algorithm chooses nodes which are interpolated well, and 
sets the parent nodes coarse. Since we want to reduce the number of coarse 
nodes, we prefer those nodes which create the least amount of new coarse nodes. 
Additionally, we want to take the best interpolating pair. The rating for each node is 
calculated by 

Ri = min \M\-\M nC\ +F im „/8 (6) 

M€Pf \F 

This leads to Algorithm 2. Note that F imn /8 < 1, so that reducing the number of 
coarse nodes has higher priority. Since a node can only be either coarse or fine, it is 
important that we have calculated more than one possible parent pair for each node 
in Algorithm 1, so that we have greater flexibility in the coarsening process. 


Algorithm 2 FAMGCoarsening 
1: calculate ratings Rj for all nodes. 

2: while interpolateable nodes left do 

3: Get interpolateable node i which is neither coarse nor fine with lowest rating 

4: Get best available parent nodes n . in for i 

5: Set i fine 

6: Set parent nodes n . m coarse. 

7: for all neighbors j £ A, do 

8: Since i cannot be coarse anymore: 

9: Remove all parent pairs in PNj which contain i. 

10: Update ratings Rj 

11: end for 

12: end while 


Parallelization 

For (3) we need to know the matrix A in the neighbors of i (AT 1 ) and the neighbors 
of the neighbors of / (/V 2 ), i.e., in parallel, we require an overlap of 2 of the matrix 
A. To ensure a consistent coarsening we also need a parallelization of the coarsening 
process. In FAMG, we use a graph coloring algorithm for this purpose so that no 
two cores which could set the same node coarse or fine are coarsening at the same 
time (Algorithm 3). 
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Algorithm 3 FAMGParallelCoarsening 

1: Calculate a graph coloring so that no two cores which could set the same node coarse or fine 
have the same color 

2: Receive coarsening data from cores with lower color 
3: call FAMGCoarsening (Algorithm 2) 

4: Send coarsening data to cores with higher color 


Putting all this together, we obtain Algorithm 4: 


Algorithm 4 ParallelFAMG 

1: Calculate Overlap 2 of A. 

2: Call GetPossibleParentNodes (Algorithm 1) 

3: Call FAMGParallelCoarsen (Algorithm 3) 

4: Send and receive prolongation on border nodes 
5: Calculate R = P T , A H = RA h P 


Agglomeration 

Parallel algorithms work best if the time spent for computation and the time spent 
for communication is well balanced. Since the coarsening Algorithms 2 and 3 
decrease the problem size, we eventually come to the point where the problem size 
on one core is so small that communication becomes the bottleneck. Let ,'V rmn be the 
minimum number of unknowns on one core. If a core has less unknowns than A m ; n 
on the current level, all participating processors send the following information to 
one processor: 

• The number of unknowns 

• The cores which they are connected to via interfaces, and the size of the interfaces 

Now an agglomeration algorithm is run, which calculates a heuristic graph parti¬ 
tioning so that 

• All participating cores have more than (Vdesired unknowns 

• The number of participating cores is maximized 

• The maximum size of the interfaces between two cores is minimized 

After that, some cores will be agglomerated and become idle. Note that the general 
idea is similar to the algorithm used in geometric multigrid: While geometric 
multigrid uses vertical cuts, FAMG employs a vertical agglomeration strategy. 


2.3 FETI-DP 

The FETI-DP method (“Dual-Primal Finite Element Tearing and Interconnecting”) 
is a domain decomposition method that belongs to the class of non-overlapping 
domain decomposition methods [1]. In this method, global continuity of the solution 
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Fig. 2 Speedup on Hermit for 2d Laplace problem, solved with GMG 

at subdomain interfaces is initially only enforced for the subdomain corners (a.k.a. 
“subdomain vertices”, “cross points”), the so-called primal variables. For the 
remaining unknowns on the subdomain edges, the so-called dual variables, continu¬ 
ity is enforced by Lagrange multipliers, which enter as additional unknowns in a CG 
iteration. In each step of the CG-iteration systems of equations must be inverted on 
the subdomains, which can be done independently in parallel. Additionally, at the 
cross points a Schur complement equation has to be assembled and solved for the 
primal variables which serves as a “coarse problem” that ensures a global exchange 
of information between all subdomains necessary for a good parallel scalability. 

While in a standard FETI-DP implementation usually only one PE per subdomain 
is utilised to solve the sub problems here defined our implementation allows also a 
multiple of processes per subdomain. We use MPI communicators to describe the 
group of PE’s working in parallel on a FETI subdomain. 


3 Numerical Results 
3.1 Geometric Multigrid 

In order to test the implementation we show the result for the Laplace equation on 
the unit cube 

— Au = f on [0, l] rf , (7) 

where d = 2,3 denotes the physical world dimension. The equation is discretized 
using the vertex-centered finite volume scheme on a regular grid consisting of 
squares and cubes respectively. Although this model problem is simple, it provides 
a good example for a variety of problems that arise in more involved real-world 
situations. 
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Table 1 Laplace 2d on Hermit, solved with GMG, 7-14 refinements 


PE 

level 

DoF 

T,(s) 

T a (s) 

T s (s) 

T a+S (s) 

E a + S (%) 

Sa-\-s 

^ideal 

4 

6 

263,169 

3.345 

0.469 

0.684 

1.153 

100.0 

1.0 

i 

16 

7 

1,050,625 

4.249 

0.506 

1.049 

1.555 

74.2 

3.0 

4 

64 

8 

4,198,401 

4.376 

0.529 

1.075 

1.604 

71.9 

11.5 

16 

256 

9 

16,785,409 

4.592 

0.529 

1.100 

1.629 

70.8 

45.3 

64 

1,024 

10 

67,125,249 

4.991 

0.518 

1.206 

1.724 

66.9 

171.3 

256 

4,096 

11 

268,468,225 

5.108 

0.521 

1.140 

1.662 

69.4 

710.8 

1,024 

16,384 

12 

1,073,807,361 

7.985 

0.530 

1.248 

1.779 

64.9 

2656.6 

4,096 

65,536 

13 

4,295,098,369 

18.856 

0.528 

1.341 

1.869 

61.7 

10112.0 

16,384 



Fig. 3 Times on Hermit for 2d Laplace problem, solved with GMG 


Table 1 shows the result of the weak scalability test for the 2d problem up to 
65’536 processes (“processing entities” or “PE” in short) on Hermit. The assembling 
of the matrix is trivially parallelizable by adding contributions of each element to 
the process-local matrix and the measured scaling is nice as expected. As displayed 
in Fig. 3, the times for the multigrid solver roughly stays constant over a long range 
of processes. 

Table 2 shows corresponding result for the 3d problem up to 4,096 processes on 
Hermit. Results are similar to the 2d case and shown in Figs. 4 and 5. 


3.2 Algebraic Multigrid 

Table 3 shows the results of FAMG solving a 2d-Laplace equation (7) on Hermit. 
The equation was discretized with a finite element method on triangles resulting in 
a 5-point-operator and solved with FAMG as preconditioner for the CG method. 
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Table 2 Laplace 3d on Hermit, solved with GMG 


PE 

level 

DoF 

T t { s) 

T a (s) 

T, (s) 

T a +s (s) 

E a+S (%) 

Sa-\-s 

^ideal 

1 

4 

35,937 

3.082 

0.585 

0.438 

1.023 

100.0 

1.0 

l 

8 

5 

274,625 

4.619 

0.766 

1.136 

1.903 

53.8 

4.3 

8 

64 

6 

2,146,689 

6.087 

0.811 

1.204 

2.015 

50.8 

32.5 

64 

512 

7 

16,974,593 

5.703 

0.820 

1.268 

2.088 

49.0 

250.7 

512 

4,096 

8 

135,005,697 

6.279 

0.807 

1.285 

2.091 

48.9 

2003.2 

4,096 

32,768 

9 

1,076,890,625 

16.407 

0.813 

1.490 

2.303 

44.4 

14554.1 

32,768 


PE number of processing entities (cores), level refinement level, T, total time spent, T a time spent 
for assembly, T s time spent for solving, T a + S , E a + S , S a + s time spent, efficiency and speedup for 
assembly and solve, Sideai ideal speedup 



Fig. 4 Speedup on Hermit for 3d Laplace problem, solved with GMG 



Fig. 5 Times on Hermit for 3d Laplace problem, solved with GMG 
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Table 3 Scaling of Laplace problem in 2d on Hermit solved with FAMG 


PE 

level 

DoF 

Total (s) 

-^setup (s) 

AMG levels 

^solve (s) 

^iter 

^solve 

Niter 

4 

8 

263,169 

6.87 

3.47 

9 

1.20 

9 

0.13 

16 

9 

1,050,625 

9.92 

4.17 

11 

2.26 

11 

0.21 

64 

10 

4,198,401 

12.11 

4.92 

13 

2.40 

11 

0.22 

256 

11 

16,785,409 

14.02 

5.92 

15 

2.78 

12 

0.23 

1,024 

12 

67,125,249 

16.60 

6.95 

18 

3.37 

14 

0.24 

4,096 

13 

268,468,225 

20.39 

8.53 

20 

3.90 

15 

0.26 


7totai total time for program, r set up time for FAMG setup, T so i vt time for CG solver using FAMG 
multigrid as preconditioner, Alterations number of CG iterations 


Table 4 Number of participating cores for parallel FAMG 


AMG level 

PEs: 4 

PEs: 16 

PEs: 64 

PEs: 256 

PEs: 1,024 

PEs: 4,096 

Used cores 

Used cores 

Used cores 

Used cores 

Used cores 

Used cores 

0-7 

4 

16 

64 

256 

1,024 

4,096 

8 

2 

8 

30 

120 

479 

1,883 

9 

1 

4 

15 

62 

249 

1,044 

10 


2 

8 

31 

124 

533 

11 


1 

4 

16 

72 

291 

12 



2 

8 

44 

188 

13 



1 

4 

24 

110 

14 




3 

13 

61 

15 




1 

7 

37 

16 





3 

17 

17 





2 

11 

18 





1 

5 

19 






2 

20 






1 


Table 4 lists the number of participating cores on each level for /V mm = /V clcsil - c( | = 
1,000 generated by the agglomeration strategy from Sect. 2.2. 

First results of FAMG on Hermit are encouraging: Over a large range of core 
numbers, the total running time grows, but still stays within an acceptable range. 
For large number of processes, both the setup time and solution time are increasing. 
Reasons may be that the coarsening rate of the standard FAMG method is only 50 %, 
resulting in a larger number of levels than geometric multigrid and that the number 
of iterations does not remain constant. The time per iteration, however, is bounded, 
and even remains constant if the increase in operator complexity is considered as 
well. 

For future work, we hope to address the previously mentioned shortcomings by 
improving coarsening strategies and graph coloring algorithms. We expect this to 
improve coarsening rates and efficiency as well. 
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Table 5 Laplace 2d on Hermit, solved with FETI-DP, and FAMG as subproblem solvers, 1 PE 
per FETI subdomain - weak scalability relative to 4 PE 


PE 

level 

DoF 

T,{ s) 

T a (s) 

T s (s) 

T a + S (s) 

E a +s {%) 

Sa-\-s 

‘S'ideal 

4 

5 

66,049 

7.484 

0.121 

1.883 

2.004 

100.0 

1.0 

l 

16 

6 

263,169 

13.717 

0.135 

6.885 

7.019 

28.5 

l.i 

4 

64 

7 

1,050,625 

18.008 

0.139 

10.419 

10.558 

19.0 

3.0 

16 

256 

8 

4,198,401 

18.812 

0.138 

10.924 

11.062 

18.1 

11.6 

64 

1,024 

9 

1,6785,409 

19.765 

0.140 

10.920 

11.060 

18.1 

46.4 

256 


Table 6 Laplace 2d on Hermit, solved with FETI-DP, and FAMG as subproblem solvers, 4 PE 
per FETI subdomain - weak scalability relative to 16 PE 

PE 

level 

DoF 

T, (s) 

T a (s) 

T s (s) 

T a + S (s) 

E a +s (%) 

Sa-\-s 

^ideal 

16 

6 

263,169 

10.652 

0.133 

3.554 

3.688 

100.0 

1.0 

l 

64 

7 

1,050,625 

18.346 

0.137 

10.004 

10.141 

36.4 

1.5 

4 

256 

8 

4,198,401 

23.019 

0.140 

14.313 

14.453 

25.5 

4.1 

16 

1,024 

9 

16,785,409 

24.598 

0.142 

15.230 

15.372 

24.0 

15.4 

64 

4,096 

10 

67,125,249 

26.904 

0.142 

15.980 

16.123 

22.9 

58.6 

256 


Table 7 Laplace 2d on Hermit, solved with FETI-DP, and FAMG as subproblem solvers, 4 PE 
per FETI subdomain - weak scalability relative to 64 PE 

PE 

level 

DoF 

TA s) 

T a ( s) 

T s (s) 

T a + S (s) 

E a + S (%) 


‘S'ideal 

64 

7 

1,050,625 

13.651 

0.141 

5.333 

5.474 

100.0 

1.0 

l 

256 

8 

4,198,401 

24.151 

0.140 

14.379 

14.519 

37.7 

1.5 

4 

1,024 

9 

16,785,409 

30.729 

0.143 

20.433 

20.576 

26.6 

4.3 

16 

4,096 

10 

67,125,249 

33.661 

0.145 

21.981 

22.126 

24.7 

15.8 

64 

16,384 

11 

268,468,225 

42.200 

0.144 

22.268 

22.412 

24.4 

62.5 

256 


3.3 FETI-DP 

To investigate and to demonstrate the potential of the FETI-DP implementation 
in ug4, we solved the model problem (7) from Sect. 2.1 and FAMG (Sect. 2.2) as 
subproblem solvers. This problem (with only one process per FETI subdomain, i.e., 
“standard” FETI-DP) has previously been used as a benchmark, e.g., in [4]. 

Tables 5-7 show the results of the weak scalability tests performed on Hermit 
for the 2d problem up to 16Ki (16’384) PE, for three series with one, four and 
16 PE per FETI subdomain. Please note that since the number of available MPI 
communicators is limited, every series is only up to 1 Ki FETI subdomains. 

As for the other methods described in this paper the assembly times scale very 
well. In contrast the solve times show a distinct jump at the beginning of each series. 
But when the speedups are computed relative to a job of a series with a larger 
number of PE’s, i.e. 64 PE, our FETI-FAMG method scales quite nicely as can 
be seen from Fig. 6 where the curves become parallels to the line denoting the ideal 
speedup. This behavior has to be investigated yet. 
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P 


Fig. 6 Speedup on Hermit for the 2d Laplace problem, solved with FETI-DP, 16 PE per FETI 
subdomain 


4 Density Driven Flow 

The transport of dissolved salt in flowing groundwater in porous media can 
be described by two nonlinear, coupled, time-dependent differential equations, 
describing the balance of the fluid-phase as a whole and the balance of the mass 
of brine (cf. [5]): 


d t (4>p) + V • (pq) = 0, 
d,(<ppco) + V ■ (pco q — pD V&>) = 0, 

q = — — (V/> — pg) (8) 

where o>, p are the unknowns brine mass fraction and pressure, <p is porosity, K the 
permeability, // the viscosity, p the density, g the gravity field, and D the diffusion- 
dispersion tensor. 

One well known benchmark is the so-called Elder problem [10]. In its 2d 
formulation it is defined on a domain Q := [0,600m] x [0,150m]. Initially the 
brine mass fraction is set to zero in the whole domain. At the middle top, a 
dirichlet boundary conditions of 1 is set for the brine mass fraction while it is 
fixed to zero at the bottom. All other boundary parts are noflow boundaries. Due 
to density differences and the gravity the evolution of the system in time will lead 
to a downward flow of brine. See Fig. 7 for a computed solution. For 3d we use a 
corrsponding setting on the domain Q := [0.600 m] x [0,600 m] x [0,150 m]. Results 
are shown in Fig. 8. 

This system of equations is discretized using a vertex-centered finite vol¬ 
ume scheme [2]. As time stepping scheme the implicit Euler method is used. 
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Fig. 7 Computed solution of the Elder problem in 2d at an early {left) and a more evolved (right) 
time point 



Fig. 8 Computed solution of the Elder problem in 3d at an early (left) and a more evolved (right) 
time point 


Table 8 Scaling of Elder problem in 2d on Hermit 


PE 

level 

DoF 

TtO) 

TnO) 

Nn 

— (s) 

n n fv 

Tgmg(s) 

Ngmg 

7gmg 
Ngmg ' ' 

64 

9 

8,398,850 

407.251 

396.775 

20 

19.839 

195.142 

352 

0.554 

256 

10 

33,574,914 

577.662 

565.889 

23 

24.604 

349.564 

636 

0.550 

1,024 

11 

134,258,690 

1424.820 

1412.930 

39 

36.229 

1099.180 

1,956 

0.562 

4,096 

12 

536,952,834 

1138.300 

1088.380 

40 

27.210 

793.910 

1,413 

0.562 

16,384 

13 

2,147,647,490 

1126.930 

1036.010 

40 

25.900 

775.619 

1,380 

0.562 


7jv time for whole newton including assembling of jacobian, Nn total number of newton steps, 
Tgmg overall time spent in multigrid cycle, Ngmg number of multigrid cycles 


The resulting discrete fully-coupled non-linear equations are solved by a Newton 
method. Within each Newton step the Jacobi-matrix of the problem is computed 
exactly. For this linearized matrix system the linear geometric multigrid solver is 
applied. As smoother a block-Jacobi smoother with inner 1LU decomposition is 
used, where the Jacobi blocks correspond to the parts of the matrix that is owned by 
each parallel process. 


Results 

Table 8 shows the results of the weak scalability test for the 2d problem. It is well 
observed that the time for each multigrid cycle is constant. Since this problem is 
non-linear, the number of Newton steps does not remain constant, but is still in a 
reasonable range. Table 9 displays the corresponding results for the 3d problem. 
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Table 9 Scaling of Elder problem in 3d on Hermit 


PE 

level 

DoF 

7f(s) 

T n (s) 

N n 

— G) 
Nn V s > 

T’gmg('S') 

Ngmg 

Tgmg 
Ngmg K ) 

64 

4 

1,098,306 

240.184 

196.331 

20 

9.817 

86.033 

457 

0.188 

512 

5 

8,586,370 

480.574 

436.349 

20 

21.817 

132.401 

679 

0.195 

4,096 

6 

67,897,602 

551.408 

500.743 

20 

25.037 

233.181 

1,170 

0.199 

32,768 

7 

540,021,250 

784.101 

602.627 

39 

15.452 

307.558 

1,551 

0.198 


7jv time for whole newton including assembling of jacobian, (Vjv total number of newton steps, 
Tqmg overall time spent in multigrid cycle, Ngmg number of multigrid cycles 


5 Conclusion and Outlook 

This report presented some scalability studies for solvers in the ug4 library. 
Although the results are still preliminary, we showed that multigrid methods are 
scalable and efficient tools for solving partial differential equations on parallel 
supercomputers. On Hermit, the tests show scalabilty for up to 65,536 cores. 
Future work will be dedicated to the consolidation and enhancement of ug4 and 
its application to more real-world problems. 
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Simulation of Liquid-Liquid Equilibria with 
Molecular Models Optimized to Vapor-Liquid 
Equilibria and Model Development for 
Hydrazine and Two of Its Derivatives 


Stefan Eckelsbach, Thorsten Windmann, Ekaterina Elts, and Jadran Vrabec 


1 Introduction 

In the chemical industry, knowledge on fluid phase equilibria is crucial for design 
and optimization of many technical processes. In a chemical plant, the costs for 
separation facilities constitute one of the highest investment outlays, typically in the 
order of 40-80 % [15]. Not only vapor-liquid equilibrium (VLE) data are of interest, 
e.g. for distillation columns, but also other types of phase equilibria. For example, 
liquid-liquid equilibrium (LLE) data provide the basis for extraction processes. 

Classically, thermodynamic data for the design of such processes have to be 
measured experimentally and have to be aggregated by empirical correlations. For 
practical applications this leads to problems. For example, it is not possible to 
describe the entire fluid phase behavior consistently with a single model and set 
of parameters. Thus LEE data cannot be predicted reliably from VLE data (or 
vice versa) based on such correlations. Furthermore, the effort for measurements 
in the laboratory is very high, because every single fluid system of interest has to 
be measured individually. This approach particularly reaches its limits when multi- 
component fluids or systems with multiple phases are of interest due to the sheer 
amount of independent variables. In a recent study by Hendriks et al. [10] about 
the demand of thermodynamic and transport properties in the chemical industry, the 
urgent need for a reliable and predictive approach to describe VLE as well as LLE 
with a single model and parameter set is pointed out. 

In this work, the capability of molecular simulation to determine fluid phase 
equilibria is examined, especially with respect to the problems of the classical 
approach described above. Molecular modeling and simulation is a modern route 
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for the prediction of thermodynamic properties. Being based on mathematical rep¬ 
resentations of the intermolecular interactions, it has strong predictive capabilities as 
it adequately covers structure, energetics and dynamics on the microscopic scale that 
govern the fluid behavior on the macroscopic scale. In preceding work, molecular 
models (force fields) were developed to accurately describe pure substance VLE 
data [21] and were successfully assessed with respect to VLE data of binary and 
ternary mixtures [ 1 8]. In the present work, the capability of those models to predict 
LLE data is studied for the binary mixture Nitrogen + Ethane. The decomposition 
of a randomly dispersed mixture of these two components in their LLE phases 
is investigated. Subsequently, the composition of the phases in equilibrium is 
compared to experimental data. It is shown that VLE and LLE can be predicted 
by molecular simulation in good agreement with experimental results. 

Furthermore, three new models for Hydrazine, Monomethylhydrazine and Dime- 
thylhydrazine were developed. They are based on quantum chemical information 
and were optimized to experimental data on saturated liquid density and vapor 
pressure. Thereafter, the models were assessed by comparing simulated VLE to 
experimental reference data, which they are able to reproduce. 


2 Phase Equilibria 

2.1 Vapor-Liquid Equilibria 

The molecular model for the mixture Nitrogene + Ethane was developed in a 
preceding work of our group [20]. The application to mixtures can be done straight¬ 
forwardly by assigning the unlike interaction parameters of the two components A 
and B. They are defined by the modified Lorentz-Berthelot combination rules. It was 
already shown in [20] that one additional parameter in the equation for calculating 
the unlike energy parameter e A n , which allows to adjust the simulation results to the 
VLE data of the mixture, leads to an improvement of the predictive quality of the 
model. 


Gab 


Oa + 
2 


( 1 ) 


^AB — 


( 2 ) 


The binary parameter of the mixture Nitrogen + Ethane was defined as ^ = 0.974 
[20], Figure 1 shows the VLE phase diagram at 200 and 290 K as determined on the 
basis of this mixture model in comparison to experimental data. The Peng-Robinson 
equation of state is given as an example of a classical correlation approach. These 
simulations coincide closely with the experimental reference values for pressures 
below 7 MPa. With raising pressure, the deviations also increase, but the values 




LLE and model development 


Fig. 1 Vapor-liquid 
equilibria of the mixture 
Nitrogen and Ethane: 
simulation results (•); 
experimental data (+) [4]; 
Peng-Robinson equation of 

state (-). The statistical 

uncertainties of the present 
data are within symbol size 


Fig. 2 Progress of the mole 
fraction of Nitrogen in phase 
1(A) and phase 2(H) over 
the duration of the simulation 
t : The simulation was 
performed for a temperature 
of 128 K and a pressure of 
11.03 MPa 
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are closer to the experiment than the Peng-Robinson equation of state, which 
overestimates the critical region and shows significant deviations particularly on 
the saturated liquid line. 


2.2 Liquid-Liquid Equilibria 

Molecular dynamics simulations were performed for a mixture of 20,000 molecules, 
consisting of 40mol-% Ethane and 60mol-% Nitrogen. The two components were 
randomly dispersed in the initial configuration. The simulations were carried out 
with our molecular dynamics code Isl mardyn. They were performed in a canonical 
ensemble and the length of one time step was set to 2 fs. 

Starting from a randomly dispersed mixture, it was found that the system decom¬ 
poses spontaneously into two coexisting liquid phases. As an example, the progress 
of a simulation at the pressure 11.03 MPa and the temperature 128 K is plotted in 
Fig. 2. The simulation required about 2.4 ■ 10 7 time steps to lead to an equilibrated 
state, which represents a typical duration of the present simulations of around 
50 ns. Thereafter, the two phases can be clearly identified by the mole fraction over 
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Fig. 3 Mole fraction of 
Nitrogen over the length of 
the simulation volume in z 
direction: the simulation was 
carried out at a temperature of 
128 K and a pressure of 
11.03 MPa 



Fig. 4 Liquid-liquid 
equilibria of the mixture 
Nitrogen + Ethane: 
simulation results (•) for a 
temperature of approximately 
128 K; experimental data (+) 
for a temperature of 
126.7-129 K [4] 



x N2 / mol/mol 


the length of the simulation volume, which is presented in Fig. 3. This provides 
the ability of predicting the LLE phase behavior for different thermodynamic 
conditions. The comparison of the present results with experimental data is shown 
in Fig. 4. The simulated mole fractions agree well with the experimental data and so 
they also reproduce the pressure dependence of the LLE. 


3 Hydrazines 

3.1 Molecular Model Class 

The present molecular models include three groups of parameters. These are (1) 
the geometric parameters, specifying the positions of different interaction sites, 
(2) the electrostatic parameters, defining the polar interactions in terms of point 
charges, dipoles or quadrupoles, and (3) the dispersive and repulsive parameters, 
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determining the attraction by London forces and the repulsion by electronic orbital 
overlaps. Here, the Lennard-Jones (LJ) 12-6 potential [11,12] was used to describe 
the dispersive and repulsive interactions. The total intermolecular interaction energy 
thus writes as 
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where r^, a,j a h are the distance, the LJ energy parameter and the LJ size 
parameter, respectively, for the pair-wise interaction between LJ site a on molecule 
i and LJ site b on molecule j. The permittivity of the vacuum is €o, whereas q !c , ii IC 
and Qi C denote the point charge magnitude, the dipole moment and the quadrupole 
moment of the electrostatic interaction site c on molecule i and so forth. The 
expressions / x (( 0 ,, co,) stand for the dependence of the electrostatic interactions 
on the orientations w, and of the molecules i and j , cf. [1,7]. Finally, the 
summation limits N , S [ J and denote the number of molecules, the number of 
LJ sites and the number of electrostatic sites, respectively. 


3.2 Molecular Pure Substance Models 

For all molecular models developed in the present work, the internal degrees of 
freedom were neglected and the models were chosen to be rigid. As a first step, 
the geometric data of the molecules, i.e. bond lengths, angles and dihedrals, were 
determined by QC calculations. Therefore, a geometry optimization was performed 
via an energy minimization using the GAMESS (US) package [17]. The Hartree- 
Fock level of theory was applied with a relatively small (6-31 G) basis set. All LJ 
parameters and the charge magnitudes were initially taken from prior models and 
fine tuned during the model parameter optimization to vapor pressure and saturated 
liquid density. 

Figure 5 shows the developed molecular models. 

The results for saturated densities obtained with the present model are compared 
to the available experimental data [4, 5, 8] and to the simulation results by Gutowski 
et al. [8] in Fig. 6. 
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Fig. 5 Snapshot of 
Hydrazine (top, left), 
Monomethylhydrazine (top, 
right) and Dimethylhydrazine 





Fig. 6 Saturated densities of Hydrazine (•), Monomethylhydrazine (A) and Dimethylhydrazine 
(■): (striped symbols ) experimental critical data [4,5,8, 14]; (+) experimental saturated liquid 
densities [4,5]; (full symbols ) simulation data on the basis of the present models; (empty symbols ) 

simulation data (saturated liquid only) by Gutowski et al. [8]; (-) correlation of experimental 

data [5]; ( - - - ) correlation of present simulation data [9]. The statistical uncertainties of the 
present data are within symbol size 


3.3 Binary Vapor-Liquid Equilibria 

Based on the discussed three molecular hydrazine models, VLE data were predicted 
for all three binary Hydrazine mixtures with Water as well as for the mixture 
Dimethylhydrazine + Hydrazine. 

3.3.1 Water + Hydrazine 

Figure 7 shows the isobaric VLE of Water + Hydrazine at 0.1013 MPa from 
experiment [4, 13, 19], simulation and Peng-Robinson EOS. The mixture is 
azeotropic, having a temperature maximum. The azeotropic point is at Xuio ~ 
0.41 mol/mol.The experimental vapor pressure by Uchidaetal. [19] at388.25Kand 
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Fig. 7 Isobaric vapor-liquid 
phase diagram of Water + 
Hydrazine at 0.1013 MPa: 
(x) experimental data by 
Lobry de Bruyn and 
Dito [13]; (+) experimental 
data by Uchida et al. [ 19]; 

(•) present simulation data 

with | = 1.3; (-) 

Peng-Robinson EOS with 
kij = -0.1325 



Fig. 8 Isobaric vapor-liquid 
phase diagram of 
Monomethylhydrazine + 
Water at 0.1013 MPa: (x) 
experimental data by Ferriol 
et al. [6]; (+) experimental 
data by Cohen-Adad 
et al. [2]; (•) present 
simulation data with ^ = 1.3; 

(-) Peng-Robinson EOS 

with k t j = -0.197 



“ V d 7 CH3-N2H3 ^ mol/mol 


Xh 20 = 0.6925 mol/mol was taken to adjust the binary parameter of the molecular 
model = 1.3). Considering the substantial experimental uncertainties, both data 
sets agree very favorably. 


3.3.2 Monomethylhydrazine + Water 

Figure 8 depicts the VLE of Monomethylhydrazine + Water at ambient pressure. 
Like aqueous Flydrazine, this mixture is azeotropic, having a temperature maximum. 
In this case, the azeotropic point lies at .tch 3 ~n 2 H 3 ~ 0.25 mol/mol. The experi¬ 
mental data by Ferriol et al. [6] at 372.55 K and XCH 3 -N 2 H 3 = 0.476 mol/mol were 
taken to adjust the binary parameter of the molecular model = 1.3). In the Water- 
rich region, to the left of the azeotropic point in Fig. 8, VLE simulations were not 
feasible, because of sampling problems. Considering the substantial experimental 
uncertainties, the data sets from all three approaches agree very favorably. 
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Fig. 9 Isobaric vapor-liquid 
phase diagram of 
Dimethylhydrazine + Water 
at 0.1013 MPa: (x) 
experimental data by 
Carleton [3]; + experimental 
data by Ferriol et al. [6]; (•) 
present simulation data with 

I = 1-3; (-) 

Peng-Robinson EOS with 
kjj = - 0.285 



Fig. 10 Isobaric vapor-liquid 
phase diagram of 
Dimethylhydrazine + 
Hydrazine at 0.1013 MPa: 

(+) experimental data by 
Pannetier and Mignotte [16]; 
(•) present simulation data 

with ^ = 1.01; (-) 

Peng-Robinson EOS with 

kij = - 0.1 



' r d ? {CH3)2-N2H2 


/ mol/mol 


3.3.3 Dimethylhydrazine + Water 

Figure 9 shows the isobaric VLE of Dimethylhydrazine + Water at 0.1013 MPa 
from experiment, simulation and the Peng-Robinson equation of state. In contrast 
to the the previous two binary systems, this mixture is zeotropic. The experimental 
vapor pressure by Ferriol et al. [6] at 345.17 K and V(CH3)2-N2H2 = 0.571 mol/mol 
was taken to adjust the binary parameter of the molecular model = 1.3). It can be 
seen in the Fig. 9 that the results obtained by molecular simulation agree well with 
the experimental results on the saturated liquid line, but overestimate the Dimethyl¬ 
hydrazine content on the saturated vapor line for intermediate compositions. 


3.3.4 Dimethylhydrazine + Hydrazine 

The VLE of Dimethylhydrazine + Hydrazine is presented in Fig. 10 at ambient 
pressure. This system is zeotropic. The experimental vapor pressure by Pannetier 
and Mignotte [16] at 346.35 K and -V(CH3)2-N2H2 = 0.4717 mol/mol was taken to 
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adjust the binary parameter of the molecular model (|= 1.01). It can be seen 
that simulation results match almost perfectly with the experimental data on the 
saturated liquid line. On the saturated vapor line, the experimental data and the 
simulation results exhibit some scatter, but the agreement is reasonable. 


4 Conclusion 

It was shown that molecular models that were adjusted to VLE data provide the 
option to simulate LLE and predict their pressure dependence. The simulations yield 
reliable data that are in good agreement with experimental values. In contrast to the 
classical approach to provide phase equilibrium data, this route has the capability of 
predicting VLE and LLE data with a single model and set of parameters. 

Lurthermore, molecular modeling and simulation was applied to predict the 
VLE phase behavior of pure fluids and binary mixtures containing Hydrazine 
and two of its derivatives. New molecular models were developed for Hydrazine, 
Monomethylhydrazine and Dime thy lhydrazine, partly based on quantum chemical 
information on molecular geometry and electrostatics. Experimental data on the 
saturated liquid density and the vapor pressure were taken into account to optimize 
the pure substance models. These pure substance properties were represented 
accurately from the triple point to the critical point. 

Lor an optimized description of the binary VLE, the unlike dispersive interaction 
was adjusted for all studied binary systems to a single experimental vapor pres¬ 
sure of the mixture in the vicinity of ambient conditions. With these binary mixture 
models, VLE data were predicted for a temperature and composition range. The 
predictions show a good agreement with experimental binary VLE data that were not 
considered in the model development. 

In this work, molecular modeling and simulation was used to predict the 
VLE phase behavior and the thermodynamic properties of pure hydrazines and 
binary aqueous hydrazine mixtures, for which experimental data were available 
for comparison. The presented molecular models were able to well reproduce 
the experimental data that were not considered in the model development. Thus, 
these new models could also be valuable for the prediction of properties under 
different conditions and for systems, where no experimental data are available. 
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A Particle-in-Cell Method to Model the 
Influence of Partial Melt on Mantle Convection 


Ana-Catalina Plesa, Doris Breuer, and Tilman Spohn 


Abstract Solid-state convection is the principal mechanism that controls the global 
dynamics and thermal evolution of the terrestrial planets. Observations such as 
seismology and mission data from geological structures at the planetary surfaces 
offer important constraints for the interior dynamics. However, our main knowledge 
stems from laboratory experiments and in particular from computer models. In the 
last years, due to the significant improve of high performance computing, computer 
simulations have became the most powerful access to this fluid-dynamical problem 
by solving partial differential equations in a discrete formulation to describe the 
flow in space and time. In the present work we will present results obtained using 
the cylindrical/spherical code GAIA with a particle-in-cell method to account for 
compositional changes due to partial melting of the mantle. 


1 Introduction 

Thermal and compositional buoyancy drive the slow creeping flow of rocky plane¬ 
tary mantles that is ultimately responsible for the heat transport efficiency, magnetic 
field generation and the formation of surface geological structures such as volcanoes 
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and tectonic plates. Besides thermal convection, which is driven by secular cooling, 
by the heat from the metallic core and the internal heat due to the decay of 
radioactive isotopes, heterogeneities due to different chemical compositions can 
strongly affect the planform of mantle convection and thus the heat transport and 
planetary evolution. Compositional heterogeneities in planetary mantles can occur 
on all spatial scales. On Earth, seismic data suggests that the region historically 
known as D" occupying the lowermost few 100 km of the mantle atop the core¬ 
mantle boundary is chemically distinct. On Mars, the so-called SNC meteorites 
likely originate from separate chemical reservoirs that have been preserved for the 
entire evolution of the planet. These meteorites are basalts and basaltic cumulates 
samples which hold a record of volcanism on Mars over most of its history since 
accretion and differentiation, to recent times [21]. Geochemical analysis of the SNC 
meteorites implies the presence of at least three isotopically distinct reservoirs, two 
of them being depleted in lithophile elements and therefore most likely situated in 
the mantle and a third one enriched and assumed to lie in the crust (e.g. [5,18]. 

Until now, due to computational reasons, most of the geodynamic models are 
highly simplified and use only thermal buoyancy to compute the slow creep flow 
inside the planetary mantles, not taking into account partial melting and its associ¬ 
ated effects. However, to investigate the above mentioned features, compositional 
buoyancy due to partial melting is one of the most important ingredients. 

In the present project, we present a particle-in-cell method which has been 
adapted to investigate the influence of partial melt on the global dynamic of 
planetary mantles using numerical simulations using the 2D cylindrical and 3D 
spherical convection code GAIA [12,13]. The model was applied in particular to 
Mars using constraints inferred by the data analyzed from the SNC meteorites [21], 


2 Mantle Convection Model 

The non-linear nature of convection enables analytical solutions to be found only 
for strongly simplified scenarios. In addition, laboratory experiments cover only a 
limited range of parameters, not always relevant for planetary evolution. Over the 
years due to the massive increase in computational power, numerical simulations 
have grown to be one of the most powerful tools in solving fluid dynamical prob¬ 
lems. Today, parallel computing has established as a standard procedure in solving 
the partial differential equations describing the temporal evolution of the mantle 
flow with a complex rheology in different geometries. To model thermal mantle 
convection, conservation equation of mass, momentum, energy and composition 
are solved in a numerical sense [26]. These equations are scaled with the mantle 
thickness D as a length scale and thermal diffusivity a: as a time scale. Their 
nondimensional formulation assuming an incompressible fluid in a Boussinesq 
approximation with a Newtonian rheology and an infinite Prandtl number is [7]: 
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dT 
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where u is the velocity held, rj is the viscosity, T is the temperature, C is the 
chemical component, e, is the unity vector in radial direction, p is the pressure, 
T sur f is the surface temperature, AS is the entropy change upon melting, c p is the 
heat capacity at constant pressure, k is the thermal conductivity, t is the time, F 
is the melt fraction and F(F) is a function, which describes the changes of the 
chemical component depending on the melt fraction. 

In the Eqs. (2) and (3) Ra is the thermal Rayleigh number and Ra q is the 
Rayleigh number for internal heat sources. Both Ra and Roq are related to thermal 
buoyancy. Rac is the compositional Rayleigh number which accounts for the 
buoyancy due to chemical heterogeneities. These Rayleigh numbers are defined as 
follows [1,4]: 


Ra = 

pgaATD 3 

(5) 

Kfl 

Ra Q = 

p 2 gaQ m D 5 
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Kkrj 
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with p the mantle density, g gravitational acceleration, AT temperature differ¬ 
ence between surface and core-mantle boundary, D mantle thickness, k thermal 
diffusivity, rj reference viscosity, Q m mantle radioactive heat sources, k thermal 
conductivity and Ap density difference upon mantle differentiation. 

The viscosity is calculated according to the Arrhenius law for diffusion creep 
[15]. The non-dimensional formulation of the Arrhenius viscosity law for tempera¬ 
ture and depth dependent viscosity [25] is given by: 


rj(r,T) = exp 


/ E + (r p — r)V 
V T + T sur f 


E + (r p 


' f re f) V \ 


T re f + Tsurf ' 


( 8 ) 


where E is the activation energy, T sur f is the surface temperature, V is the activation 
volume, r is the radius, and r p is the planet radius. Tref and are reference values 
for temperature and radius. 

In our simulations we consider a one-plate planet with cooling boundary 
conditions and decaying radioactive elements. Radioactive elements, namely the 
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uranium isotopes 235 U and 238 U, the thorium isotope 232 77;, and the potassium 
isotope 40 K, are considered [1]: 


Q m = 0.9928 C"H U ™ exp + 0.0071 C%H u% 
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where //' are the rates of heat production, r{, 2 the half lives of the isotopes, Cq the 
initial concentrations and t the time. 

During the melting process, the extraction of melt and formation of crust redis¬ 
tributes the radioactive heat producing elements (HPEs). Being highly incompatible, 
HPEs are enriched in the melt and by extracting the melt to form crust, the HPEs 
become depleted in the residual mantle. The radioactive heat sources are calculated 
using the accumulated fractional melting formula which allows us to obtain their 
concentration in the melt [6,20]: 

T F = _ (i _ F) 1 /*) (10) 

The term Tbuik is the initial concentration in the mantle and 8 the partitioning 
coefficient [16]. Typical values of 8 = 0.002 for the radioactive heat sources are 
used. These values are obtained when averaging the partitioning coefficients by their 
mineral percentage in the mantle. The lower the partitioning coefficient the faster the 
mantle depletion in the incompatible elements. Assuming accumulated fractional 
melting and the partitioning coefficients from [6], nearly complete extraction of 
heat-producing elements for melt fractions greater than 1 %. 


3 Technical Realization 

We consider the mantle convection in 2D cylindrical [23] and 3D spherical shells 
using the code GAIA [12,13]. GAIA uses BICGStab with a Jacobi preconditioner 
to solve the resulting linear systems of equations. For the matrix storage a Harwell- 
Boeing sparse matrix class has been chosen. 

The discretization of the governing equations is based on the finite-volume 
method with the advantage of utilizing fully irregular grids in three and two 
dimensions, efficiently parallelized for up to 396 CPUs [12, 13]. The space is 
discretized by a fixed grid while for the temporal discretization a fully implicit 2nd- 
order method, also called an implicit three-level scheme, after [9] has been used. 
In contrast to spatial discretization, the temporal discretization is flexible and can 
adapt with a varying time step At to the situation. A method proposed by Caretto 
et al. in [2] and Patankar in [22] called SIMPLE was adopted to solve the coupling 
of the continuity equation with the momentum equation. 
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The model was validated by a comparison with analytically known solutions 
as well as published numerical results [12, 23]. A comparison with a commercial 
product also yielded satisfying results. A convergence test with successively refined 
grids proved the convergence of global quantities towards an extrapolated solution 
[14]. 

When modelling thermo-chemical convection, one of the most challenging parts 
is the handling of Eq. (4). In [24] two methods for modelling active compositional 
fields have been compared. The grid based method substitutes Eq. (4) by a partial 
differential equation alike the heat transport equation with a marginal diffusivity: 

d C 1 

— + u-vc-— v 2 c + r(F) = o (ii) 

at Le 

where Le is the Lewis number, which is the ratio of thermal diffusivity to chemical 
diffusivity. Major disadvantages of this method are the non-negligible numerical 
diffusivity and the simultaneous advection of multiple materials with different 
physical properties (e.g. [8]), for which further equations have to be solved [24]. 

A better approach to model Eq. (4) is to use a particle-in-cell method [24]. In this 
case, massless particles are advected in every time step according to the velocity 
field. We have adapted this method to account for multiple physical properties which 
change during the melting process. In our simulations, particles can carry properties 
like density, thermal conductivity, radioactive heat sources, water concentration 
etc. At the beginning of a simulation the particles are distributed within each 
cell of the computational domain. The initial conditions are interpolated from the 
corresponding fields onto the particles. In this way complex initial conditions can 
easily be imposed on the particles. In every time-step of the simulation the particles 
are moved using the velocity field computed from the momentum equation (Eq. (2)). 
The next position of a particle is then computed by solving a trajectory equation 
using the Runge-Kutta 4th-order method. After the new position of every particle 
has been determined, the particles’ physical values are interpolated back onto the 
grid to solve the Eqs. (1-3) in the next time-step. 

The drawback of this method compared to the grid-based approach is the increase 
in both memory and computational time (see Fig. 1). However, this is natural 
keeping in mind that 15-20 particles per cell are needed in order to maintain the 
accuracy when interpolating back from the particles to the field [3]. The major 
advantage are the diffusivity free method and the fact that multiple different physical 
properties can be handled simultaneously. 

At the moment, beside GAIA, only a small number of numerical codes world¬ 
wide [17,28] can handle 3D spherical geometry using a particle-in-cell method to 
account for the active compositional fields. 
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Fig. 1 Slowdown factor depending on the number of particles used. A 2D cylindrical test with a 
grid resolution of 1.5 x 10 5 computational points and an increasing number of particles per cell 
has been used. The choice of 5 particles per cell results in a total of 7.4 x 10 5 particles in the 
whole computational domain, while 30 per cell correspond to a total of 4.4 x 10 6 particles. The 
benchmark has been performed using 96 computational cores at the SCC Karlsruhe on the XC4000 
supercomputer 


4 Results and Discussion 

First we present a comparison between the grid-based and particle-in-cell methods 
to illustrate the advantages of the latter when modelling active compositional fields. 
For this we consider a test where a dense ball of composition 1 sinks into a less dense 
medium which has the composition 0. Both the ball and the surrounding medium 
have the same viscosity. During the movement of the ball, its shape will deform. For 
this test we impose free-slip boundary conditions for the velocity and turn off the 
solving of the energy equation since we are only interested in the buoyancy caused 
by the compositional heterogeneities. 

The compositional Rayleigh number Rac is set to 1. The tests were performed 
using both grid-based and particle-in-cell methods with a 1.5 x 10 5 computational 
points grid. In the particle-based method case we use 30 particles per cell resulting 
in a total of 4.5 x 10 6 particles for the entire computational domain. 

Figure 2 shows the superiority of the particle-in-cell method. While deformation 
takes place in both cases (i.e. grid-based method and particle-in-cell-method), 
we clearly observe the pronounced numerical diffusion in the grid-based case. 
Therefore for the rest of the presented results we use the particle-in-cell method 
to compute the changes in different physical properties due to melting. 
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Fig.2 (a) Particle-in-cell case, (b) grid-based method, (c) difference image between (a) and (b). 
In the grid-based case the ball’s shape diffuses whereas in the particle-in-cell case the ball in spite 
its deformation remains separate from the surrounding material 


We compare next two cases including partial melting and its associated effects in 
a 2D cylindrical and a 3D spherical geometry. For the 2D cylindrical geometry case 
we use a geometry factor to rescale the inner to outer perimeter in order to match 
the inner to outer surface ratio in the 3D geometry [11]. Tests show that without 
using this scaling the heat-flow from the core into the mantle is overestimated in the 
2D geometry resulting in about 100K higher mantle temperatures. This can have a 
major influence, overestimating the melt fraction and the crustal volume produced 
during the thermal evolution. 

For the 2D cylindrical geometry we use a 1.5 x 10 5 computational points grid 
resulting in a 10 km resolution and a total number of 2.2 x 10 6 particles for the 
entire computational domain. In the 3D spherical geometry we use a 2.8 x 10 6 
computational points grid resulting in a 30 km resolution and a total number of 
2.2 x 10 7 particles for the entire computational domain. 

The melting effects considered in these cases are (1) the influence of the melting 
temperature due to the loss of low-melting point components during the melting 
process [19], (2) decrease of the density of the mantle material due to the mantle 
depletion in crustal components [4], (3) decrease of thermal conductivity in the crust 
[27] and (4) radioactive elements redistribution [10]. 

When melting occurs in the mantle, the crustal volume and therefore the 
corresponding crustal thickness are calculated. The physical properties stored on 
the particles are then changed depending of the particle position (crust or mantle). 
For the particles which lie in the crust, the thermal conductivity is lowered and the 
amount of radioactive elements is increased. The mantle particles become depleted 
in radioactive elements while in this test we don’t consider a change in density 
between depleted and undepleted mantle material. 

Figure 3 shows a good agreement between the 2D cylindrical geometry and 3D 
spherical geometry case keeping in mind the different resolutions applied for the 2D 
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Fig. 3 (a) Temperature slice (2D cylindrical geometry) - partial melt regions in white, (b) temperature iso-surface (3D spherical geometry) - partial melt 
regions in red, and (c) radial profiles (the dashed lines show the 2D and the full lines the 3D geometry respectively) at t = 2.5Ga for a case with 1, 600 K 
initial mantle temperature and no compositional changes upon mantle differentiation 
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Fig. 4 Depletion slices and radial profiles translated as density (upper row) and temperature slices 
and temperature radial profiles (lower row) at t = 2.5 Ga and t = 4.5 Ga for a case with 1. 900 K 
initial mantle temperature and compositional changes, which cause a difference in the mantle 
density of 60 kg/m 3 upon 30 % mantle differentiation 

and 3D case (i.e. 10 km resolution in the 2D case whereas only 30 km resolution has 
been used in the 3D case). 

For Mars the SNC meteorites, suggest that on Mars separate reservoirs have 
formed early in the planets’ evolution and did not mix since. To explain the 
formation of such reservoirs, we consider density variations due to the extraction 
of crustal components during the melting process. Density difference of 60 kg/m 3 
upon 30 % differentiation (from peridotite - fertile mantle material - to residues 
like harzburgite) is considered. The composition for higher degrees of depletion 
like dunite (up to 60% depletion) does not cause any changes in the residue 
density. Therefore the maximum density variations are reached between peridotite 
and harzburgite (0-30 % depletion). 

The test in Fig. 4 was performed for a dry mantle rheology, which assumes a 
reference viscosity of 10 21 Pa s assuming an initially depleted layer and neglecting 
the effects of partial melting on the thermal conductivity and radioactive heat 
source redistribution. Figure 4 shows density variations resulting in mantle inhomo¬ 
geneities which form a depleted layer that prevents the lower mantle from efficient 
cooling. In this case hot material from the lower mantle rises to shallower depths 
where it can melt. The heterogeneities produced due to mantle depletion in this case 
remain stable during the entire thermal evolution and could explain the isotopic 
characteristics of the SNC meteorites. 
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5 Conclusions and Outlook 

In this work we have presented results using 2D cylindrical and 3D spherical 
convection models which account for effects of partial melt during the thermal 
evolution of terrestrial planets. We have adapted a particle-in-cell method to account 
for the changes in physical properties due to melting and have applied our model in 
particular to Mars. 

The results presented in Sect. 4 show (1) the superiority of the particle-in-cell 
method compared to a grid-based approach, (2) a good agreement of 2D cylindrical 
and 3D spherical geometry results when using a scaling of the inner to outer surfaces 
in 2D to match the surfaces ratio in 3D geometry and (3) the formation of chemical 
reservoirs which remain stable during the entire thermal evolution, as inferred by 
the SNC meteorites for Mars. 

Over the past 2 years we have improved our mantle convection code to account 
for partial melting effects by using a particle-in-cell method in both 2D and 3D 
geometry. Satisfying results have been achieved when applying our methods to 
Mars-like parameters. In a follow-on project, we plan to investigate the effects of 
partial melting on terrestrial exo-planets. In the last several years, space missions 
like CoRoT and Kepler have provided us with promising candidates for terrestrial 
exo-planets (i.e. terrestrial planets outside the Solar System with masses less than 
10 Earth masses and/or radii around or below 2 Earth radii). 

Due to the measurements errors in mass and radius, the interior structure of an 
exo-planet and hence the mantle thickness is poorly constrained. This can have a 
major effect on the partial melt production in the mantle, since for a thin mantle, the 
planet is expected to cool faster and hence less amount of melt is produced during 
the planet’s thermal evolution. Therefore, we plan to apply our model to investigate 
how the mantle thickness influences the thermal evolution and the partial melting in 
the mantle. For this we plan tests with (1) Mercury-like, i.e. thin mantle, (2) Earth¬ 
like and (3) Moon-like interior structure, i.e. thick mantle. 

Due to the detection methods, most of the discovered exoplanets are closed to 
their stars and therewith tidally locked (i.e. during it’s rotation around the star, 
the planet faces the star with the same side). In fact, this is also suggested for 
the two terrestrial exoplanets CoRoT 7b and Kepler 10b. We will use our model 
to investigate how the surface temperature variations influence the convection 
structure and the partial melting in the mantle. For this study tests with (1) uniform 
surface temperature, (2) day-night scenario and (3) cold poles and warm equator 
will be performed. 

Both mantle thickness and surface temperature variations can have a major effect 
on the partial melt production and therefore on the entire thermal evolution of a 
terrestrial exo-planet. Therefore we plan to apply the GAIA mantle convection code 
to investigate the thermal evolution of terrestrial exo-planets accounting for partial 
melting, various mantle thickness and surface temperature variations. 
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A Forward Model of Mantle Convection with 
Evolving Continents and a Model of the Andean 
Subduction Orogen 


Uwe Walzer, Roland Hendel, Christoph Kostler, Markus Muller, Jonas Kley, 
and Lothar Viereck-Gotte 


Abstract Some essential features of Andean orogenesis cannot be explained only 
by a dynamic regional model since there are essential influences across its vertical 
boundaries. A dynamic regional model of the Andes should be embedded in a 
3-D spherical-shell model. Because of the energy distribution on the poloidal 
and toroidal parts of the creep velocity and because of geologically determined 
mass transport alongside the Andes, both models have to be three-dimensional. 
Furthermore, we developed a new viscosity profile of the mantle with very steep 
gradients at the lithospheric-asthenospheric boundary and at a depth of 410 and 
660 km. Therefore, the challenges to the code Terra are now essentially larger. In 
the last 3 years we have resolved these problems in an international cooperation 
(see Sect. 2.2). Based on the new viscosity profile and on an improved Terra, we 
computed a new forward spherical-shell model (Walzer and Hendel, J Geophys 
Res submitted, 2012b). For this model, we derived also a new extended acoustic 
Griineisen parameter, y ax , new profiles of the thermal expansivity, a, and of the spe¬ 
cific heat, c v , at constant volume as well as a solidus depending on both the pressure 
and the water abundance. These innovations are essential to incorporate a chemical- 
differentiation mechanism into the model. We arrived at rather realistic episodes of 
continental growth interrupted by magmatically quiet time spans distributed over the 
whole time axis. Nevertheless, the model shows a main magmatic event at the very 
beginning of the Earth’s evolution. Papers on the improvement of Terra (Kostler 
et al. Comput Geosci submitted, 2012; Muller and Kostler, Int J Numer Methods 
Eng submitted, 2012) have been written. We conceived a regional model of the 
Andean orogenesis (Sect. 3.2.1) with the same new viscosity profile. We want to 
investigate why there is fiat-slab subduction in some segments of the Andes and 
why deformation of the crust and volcanism migrate eastward. The evolution of 
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the abundances of incompatible elements indicate a cycle which was finished by a 
fast process, perhaps by a large-scale delamination of the lower plate, perhaps also 
by another type of delamination. In connection with another spherical-shell model 
(with prescribed plate boundaries), the regional model should numerically explain 
why a plateau-type orogen evolved at an oceanic-continental plate boundary. 


1 Introduction 

The papers which refer to our topic can be classified by seven subjects. 

(a) Geological description of dynamic problems of thermal evolution of the Earth, 
plate tectonics and chemical differentiation as well as of the Andean orogeny, 

(b) Models which are partially kinematic and partially dynamic. In this kind of 
models, essential features are prescribed in order to gain a large adaptation to 
geological and geophysical observations. 

(c) Geochemical models of growth and differentiation of continents which do not 
contain any dynamic modeling, 

(d) Dynamic models of the subduction process to understand the physical mecha¬ 
nism behind subduction, 

(e) Circulation models, 

(f) Fully dynamic models of subduction in a diamond (cf. Fig. 8), i.e. in a certain 
3-D sector of the spherical shell, which represents the mantle, where this 
diamond is embedded into a realistic 3-D spherical-shell solution, 

(g) Fully dynamic forward models of spherical-shell convection with chemical 
differentiation and generation of continents. 

We systematically described the papers of types (a)-(e) in [110]. Therefore, we 
will not repeat it here. Up to now, there is no paper of type (f). Some supplements 
will follow: To obtain a more profound analysis of the Andean orogeny, it is 
important to understand why a plateau-type orogen formed between a purely 
oceanic lithospheric plate (Nazca plate) and a continent (South America) [71]. 
The Andean mountain belt belongs to the non-collisional type. Kley and Monaldi 
[42,45,46] found a Cenozoic shortening of 250-350km whereas Arriagada et al. [4] 
derived 400 km for the central Andes. In other areas of the Earth, however, the 
subduction zones at an oceanic-continental plate-boundary site have only little or 
no shortening and do not show any elevated plateaus. In most cases, the upper plate 
is characterized by backarc extension. Schellart and Rawlinson [82] discuss some 
hypotheses of which it is claimed that they explain this exceptional behavior of the 
Andes. 

• In [66], it is proposed that the young age of the Nazca plate and low negative 
buoyancy cause this phenomenon. 
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• Climatic conditions, high friction and subduction erosion is thought to be the 
decisive factor [57,58]. 

• Heuret and Lallemand [30] emphasize the eminent role of acceleration of the 
westward movement of the upper, South American plate. Sobolev et al. [84] find 
out that the accelerating westward movement of the South American plate is the 
most important factor for the Andean orogeny. 

The third hypothesis generates the question of the mechanism which drives South 
America westward. The ridge push at the mid-Atlantic ridge and the slab pull at 
the Lesser Antilles and the Scotia arc have been proposed but it could be that these 
contributions are too small. Schellart and Rawlinson [82] remarkably mention the 
slab pull of the Nazca plate. However, we propose that a downwelling of the bulk 
convection beneath South America caused by the thermal screening of the thick 
continental lithosphere could play a role. If there is a large upwelling east of South 
America and a large downwelling of the bulk convection under South America we 
could understand why the bulk convection current would move the South American 
plate westward since, because of the thick continental lithosphere, South America 
is not decoupled from the bulk convection by the asthenosphere. Additionally, we 
observe a large upwelling of the bulk convection beneath the Pacific. So we can 
expect that another current of the bulk convection will go eastward, producing the 
arcs of the Lesser Antilles and Scotia. It is unclear if these speculative arguments 
are appropriate. However, they show not only the necessity of dynamic, numerical 
regional models but also that the regional model must be nestled among time 
dependent boundary conditions determined by a global dynamic model. Evidently, 
some essential features of the Andean orogenesis cannot be explained by an isolated 
regional model. 

We discuss further new geological and geophysical papers in Sect. 3.2.1. We 
did that on purpose in order to substantiate why we used certain details in our 
development of two special (alternative) mechanisms for a regional model of the 
South American subduction zone and Andean orogenesis. Further new papers 
are mentioned in Sect. 3.2.2 in order to select an appropriate set of prescribed plate 
movements in the surrounding spherical-shell convection model which is necessary 
to embed the regional model. 


2 A Spherical-Shell Forward Model and Other Results 

Under Sects. 2.1-2.3, we describe what we have done using the system HP XC4000 
of the Steinbuch Center for Computing in Karlsruhe. Under Sect. 2.4, other efforts of 
us which have some relation to the topic are reported. There is a direct relationship 
between the Sect. 2.1-2.3 and the running works which we describe in Sect. 3.2.1 
in rather distinct outlines, i.e. we discuss our specific South American dynamical 
model which is embedded in a second spherical-shell model. 
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2.1 Spherical-Shell Model: Forward Model 

We developed two spherical-shell convection models. The forward model is 
described here; another model with prescribed plate movements is outlined 
under Sect. 3.2.2. For both spherical-shell models as well as for the regional models 
of the Andes (Sect. 3.2.1), we newly derived, in some cases adopted new radial pro¬ 
files of the relevant physical parameters of the mantle [101]. As a Griineisen param¬ 
eter, we calculated an extended acoustic gamma, y ax , using seismic observations. 
These observed values are the bulk modulus K, the shear modulus /!, dK/dP and 
dyc/dP, where P is the pressure. This procedure has the advantage to be based on 
observable quantities without any assumptions on mineralogy. A further calculated 
property is the adiabatic gradient. We derived new profiles for the thermal expansiv¬ 
ity, a, and the specific heat, c v , at constant volume. For the chemical differentiation 
we used a simple melting criterion, T > fyT m , where T is the temperature and T m is 
the solidus. We estimated the lower-mantle solidus using values of [60] and of other 
authors continuing them by using our y ax . In the upper mantle and the transition 
layer, we took into consideration the water dependence of the solidus. Litasov [61] 
investigated the influence of different water concentrations on the solidus of peri- 
dotite and we took the derived depressions of the solidi into account in the computa¬ 
tion of our convection-differentiation model of the mantle’s evolution. So in our new 
model, the melting temperature is a function of time and position. The most impor¬ 
tant innovation of the new convection model concerns our newly derived viscosity 
profile which is based on solid-state physics and seismological results [100,101]. 
This viscosity distribution has a resemblance to the viscosity model of [65] although 
its derivation is totally different. The full set of convection-differentiation equations 
has been solved using the improved code Terra (see Sect. 2.2). For each run, we 
obtained lots of parameters, some of which can be compared with observational 
quantities. For ages from 4,490 Ma to the present time, we received the curves of the 
laterally averaged heat flow density qob at the surface, the converted continent-tracer 
mass per Ma, the Urey number Ur, the Rayleigh number Ra, the Nusselt number Am, 
the volumetrically averaged mean temperature Tmean of the mantle, showing a very 
realistic temperature drop when compared with Archean komatiite temperatures, the 
integrated mass of continents, the kinetic energy of the mantle flow Ekin, the radio¬ 
genic heat production Qbar and the laterally averaged heat flow density qcmb at the 
core-mantle boundary (CMB). Further results are the vector field of the creeping 
velocity and the temperature distribution for every time step of the Earth’s evolution. 
Up to now, we varied the melting parameter and the thermal conductivity k. 
There are — k clusters of realistic totalities of solutions. Figure 1 represents an 
episodic distribution of juvenile additions to the continents which is connected with 
the episodicity of orogenetic epochs. Figure 2 shows the present-day distribution of 
continents of the same run. In this example, 42.2 % of the Earth’s surface is covered 
by continents. The present-day surface heat flow of this run is 81.39mW/m 2 , 
comparable with the observed value of 90.185mW/m 2 [12]. The present-day value 
of qcmb of this run is 20.20 mW/m 2 which seems to be realistic, too. 
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Fig. 1 Juvenile additions to the sum of continental masses acc. to the new convection- 
differentiation model Run 498, [101] 



Fig. 2 The distribution of continents (red), oceanic lithosphere (yellow) and oceanic plateaus 
(black dots) for the present time according to the new convection-differentiation model [101], Run 
498, r„ = 0.5, a y = 120MPa, continental percentage = 42.2 % 


We computed the shear viscosity, r /, using 


, a * ^ in r. exp (cT m /T av ) 

r](r, 9, <p,t) = 10 " •-• t] 4 (r) ■ exp 

exp(c T m /T st ) 


c t ■ T„ 




(1) 


where r is the distance from the Earth’s mass center, 9 the colatitude, cj> the 
longitude, t the time and r„ the viscosity-level parameter. The quantity r„ has been 
varied to shift the viscosity profile to the left or to the right. So we generated 
different time-averaged Rayleigh numbers, Ra, varying from run to run. T m is 
the newly invented melting temperature [101] which additionally depends on 
water abundance. So T m is a function of time, too. T av is the laterally averaged 
temperature, T st the initial temperature profile. The quantity )/ 4 denotes the new 
viscosity profile [101 ] for the initial temperature and r n = 0. For MgSiC >3 perovskite 
we should insert c = 14, for MgO wiistite c = 10 according to [113]. Therefore, 
the lower-mantle value of c should be somewhere between these two values. For 
numerical reasons, however, we are able to use only c = 7. In the lateral-variability 
term we use c t = 1. The temperature is denoted by T. 
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Fig. 3 We compare different runs which are distinguished only by the viscoplastic yield stress, 
a y , and the viscosity-level parameter r„. The symbols represent the magnitude of the difference, 
d *, between the computed and observed present-day surface percentage of continents, expressed 
in percent 


Figure 3 shows the difference between computed and observed present-day 
continental surface percentage. The central run with r„ = 0.5 (corresponding to 
Ra ss 10 8 ) and a y = 120 MPa has the minimum difference between theory 
and observation, namely 1.85%. This run shows also other optimal results, e.g. 
regarding the temporal distribution of the magmatic-activity episodes (cf. Fig. 1) 
and the (spectral) distribution of continents (cf. Fig. 2). Further variations of the 
parameters refer to the factor / 3 of the melting criterion and to the thermal 
conductivity, k, keeping r„ and a y fixed at the mentioned two values. At a first 
glance on Fig. 4, we could have the impression that there is a trade-off between k and 
/ 3 since optimal solutions (black circles with an outer ring) cluster along a certain 
curve. But all the other observable quantities in other / 3 —k plots show that only the 
three optimum solutions with k = 5.0 W/(m-K) in the upper right corner of Fig. 4 are 
realistic. Figure 5 demonstrates, e.g., that only thermal-conductivity values around 
k = 5.0 W/(m-K) lead to solutions which are satisfactory for all observables. This 
value is acceptable also from the physical point of view [115]. Therefore, we varied 
/ 3 in small steps from 1.000 downward, keeping k = 5.0 W/lm-K). As expected 
from Figs. 4 and 5 and similar / 3 — k plots we obtained very realistic solutions 
down to / 3 = 0.985. However, already / 3 = 0.983 and / 3 = 0.981 generates less 
convincing results. Figure 6, e.g., shows the corresponding present-day distribution 
of continents. 
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Fig. 4 Keeping r n = 0.5 and a y = 120 MPa fixed, we vary the melting-criterion factor, fj, and the 
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Fig. 6 Less realistic distributions of continents (red), oceanic lithosphere ( yellow ) and oceanic 
plateaus (black dots) for the present time. r n = 0.5, a y = 120MPa, k = 5.0 W/(m-K), fa = 0.983 
and fs =0.981, respectively 


2.2 Numerical Improvements 


Two submitted publications of our group [55,69] deal with numerical improvements. 
Their conclusions will not be discussed here. However, their results are very 
important for the realization of our Andean model (Sect. 3.2.1). 

A big part of our efforts was concentrated on the creation of an essentially 
improved Terra code, which is necessary to resolve the numerical problems with the 
newly derived viscosity profiles. We equally apply this new viscosity profile [101] 
to a spherical-shell mantle convection model [100,101] as well as to the regional 
model of the South American subduction slabs (Sect. 3.2.1). In 2009, the Terra 
developers of the universities Munich (H.-P. Bunge, M. Mohr), Cardiff (H. Davies), 
Leeds (P. Bollada), Jena (M. Muller, C. Kostler) and of the Imperial College London 
(R. Davies) and others started a close cooperation in the further development of the 
Terra code and intensified the collaboration with J. Baumgardner (San Diego, USA), 
the inventor of Terra. At a first joint meeting in Munich in 2009, the group decided 
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to set up a community svn-repository for further code development, supplemented 
by trac, a web-based project management and bug-tracking tool, and automated 
compile/test cycles using BuildBot. From then on the group worked on a common 
code base using automated tests for every revision of the code. There have been 
three successive joint meetings in Cardiff in 2010, in Jena in 2011 (see also http:// 
www.igw.uni-jena.de/geodyn/terra201 l.html) and in Munich in 2012. The progress 
being made since the first meeting includes the following items. 

• Enhancement of the code to increase global resolution and maximum number of 
MPI processes. 

• Further development and integration of the Ruby test framework [68] into the 
automated BuildBot tests. 

• Implementation of a finite-element inf-sup stabilization using pressure- 
polynomial projections proposed in [15]. 

• Development and implementation of an efficient preconditioner for the variable- 
viscosity Stokes system [54]. 

• Refinement of the Pressure Correction algorithm [54], giving more robust 
convergence. 

• Restructuring of the code to use language features of Fortran95 and Fortran2003 
where possible. 

• Integration of automated code documentation using doxygen. 

• Integration of VTK-support and automated visualization. 

• Significant improvements in the formulation of the free-slip boundary condition 
on the spherical surface. 

Continued effort is spent on several numerical and technical topics as well as on 
including more realistic physical models. In the following, some important details 
are given. 

• In [97, 100, 101], two pressure- and temperature-dependent viscosity profiles 
of the Earth’s mantle are developed and used for the computation of two 
mantle convection models with chemical differentiation of oceanic plateaus and 
generation of continents. The new viscosity model includes very strong viscosity 
gradients at lithosphere-asthenosphere boundary and at 410- and 660-km phase 
boundaries. Therefore, it is very important to make the code Terra fit for such a 
strong challenge. Regarding a physically consistent variable viscosity momentum 
operator, J. Baumgardner, P. Bollada and C. Kostler figured out in which way 
the code has to be changed to apply a physically consistent A-operator using 
cell-averaged viscosities. The most significant code change is the switch from 
nodal based to triangle based operator parts on the sphere. The viscosity- 
weighted summation over triangular integrals is then done in the application 
of the operator. We expect that the cost for applying Au will be doubled but a 
consistent formulation on all grid-levels could pay off for this, especially if we 
get a better convergence rate of the multigrid algorithm. The implementation of 
the triangle-based operator formulation is now under way. 
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• Exporting Terra-operators in sparse matrix format: When exporting the 
FE-matrices for the whole grid, they can be analyzed and PETSc or other parallel 
solver packages can be applied to it. This is intended to provide flexibility and to 
ensure reliability of future code changes. 

• Multigrid(MG)-implementation: As mentioned in [90], the current MG- 
implementation in Terra does not fulfill the expectations raised by the 
performance of a 2-D Cartesian version of Terra, documented in [114]. In [54] it 
is also identified to be the worst performing part of the iterative solver in Terra. It 
is only poorly analyzed, and it is not satisfactory documented. M. Mohr is going 
to analyze and document the currently used matrix-dependent transfer multigrid 
in detail. The MG-implementation is also to be changed to using cell-based 
viscosity averages. 

• Free-slip boundary condition and propagator matrix benchmark tests: P. Bollada 
and R. Davies showed that adding boundary terms to the right hand side of the 
momentum equation reduces some sort of errors while other kinds of errors still 
exist. They will continue to figure out the exact cause of that behavior and work 
on fixing this. With direct access to the radial velocity component, the local 
spherical coordinate system offers a way to straightforward implement the free- 
slip boundary condition. J. Baumgardner implemented this in a local copy of 
Terra, and it is ready to be used. The group has agreed to create a repository 
branch to continue working on that. If successful, the local spherical coordinate 
system-version will be merged into the trunk after some months of testing. 

• Adding Ruby tests: The Ruby test framework is ready to be used extensively 
in testing Terra’s subroutines as individually as possible. It can also be used for 
debugging by application of subroutines to predefined scalar and vector fields. 
Still the test coverage of the Terra by Ruby tests is very low and needs to be 
extended. 


2.3 Andean Model 

A large expenditure of time of U. Walzer in the last 3 years was the analysis 
and synopsis of geophysics, geology and geochemistry of the Andean orogenies. 
This has been done in close cooperation with J. Kley and L. Viereck-Gotte. Some 
geophysical and geological considerations are outlined in Sect. 3.2.1. 


2.4 Long-Term Related Works of Our Group 

Kley investigated the structural geology of some areas of the central Andes [40] and 
gave a regional structural analysis and kinematic restoration [41,47]. He participated 
in an effort to extend quantitative structural analysis to a transect right across the 
backarc area. First steps were also taken towards constraining the evolution of 
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strain rates over time [49], leading to the suggestion that the continental strain 
rates had increased during the Andean orogeny. Kley began to extend kinematic 
analysis to the entire orogen, employing serial balanced sections and map view 
restoration techniques [44]. A map view kinematic model of the central Andes [45] 
formed the basis for comparison of geologically derived, orogen-scale kinematics 
with GPS and seismologic data [31, 32, 52]. It could be shown that the present- 
day strain field from satellite geodetic data closely matches the strain field for the 
last lOMa as inferred from geologic evidence. Additional evidence was presented 
that the strain rate in the South American plate becomes independent of plate 
convergence rate in the later stages. Using the map-view strain field as input for a 
numerical model, an attempt was also made to constrain crustal thickness evolution 
and the flux of crustal material during the Andean orogeny [33]. The studies on 
variations in structural style along the Andes [42] also triggered a second line of 
research dealing with the influence of inherited lithospheric heterogeneities on the 
spatial strain distribution. One important factor here is the widespread occurrence 
of Mesozoic rift basins [50] that were partially inverted as the thrust front migrated 
across them. Several case studies from a particularly well-exposed rift system 
in northern Argentina helped to clarify the importance of fault reactivation and 
stratigraphic discontinuities in conditioning the mechanical behavior of the upper 
crust in contraction [43,46,51,67]. The results of these studies were incorporated in 
[48,71], 

Walzer and cooperators worked on convection-fractionation problems. The ther¬ 
mal evolution of the mantle and the chemical evolution of the principal geochemical 
reservoirs have been modeled simultaneously by a fractionation mechanism plus 
2D-FD thermal convection [93-96]. Oceanic plateaus, enriched in incompatible 
elements, develop leaving behind the depleted parts of the mantle. The resulting 
inhomogeneous heat-source distribution generates a first feed-back mechanism. The 
lateral movability of the growing continents causes a second feed-back mechanism 
[95]. Effects of the viscosity stratification on convection and thermal evolution of a 
3D spherical-shell model have been investigated and a viscosity profile of the mantle 
was developed [103, 104]. The paper [106] presents 2D and 3D thermochemical 
models of mantle evolution where a self-consistent theory is included using the 
Helmholtz free energy, the Ullmann-Pan’kov equation of state, the free-volume 
Griineisen parameter and Gilvarry’s formulation of Lindemann’s law. In order to 
obtain the relative variations of the radial factor of the shear viscosity, the pressure, 
P, the bulk modulus, K, and dK/dP from the seismic model PREM have been used. 
The publications [97, 104,105, 107] present models of self-consistent generation of 
stable, but time-dependent plate tectonics on a 3D spherical shell. Different types of 
solutions have been found for different models by systematic variation of parameters 
[97, 102,104,108, 109]. Stirring effects are investigated in [22]. A 3D spherical-shell 
mantle convection and evolution model with growing continents [97, 100,101,108, 
109] has been developed. The evolution model equations guarantee conservation of 
mass, momentum, energy, angular momentum, and of four sums of the numbers of 
atoms of the pairs 238 U- 206 Pb, 235 U- 207 Pb, 232 Th- 208 Pb, and 40 K- 40 Ar. The pressure- 
and temperature-dependent viscosity is supplemented by a viscoplastic yield stress. 
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The lithospheric viscosity is partly imposed to mimic the viscosity increase by 
chemical layering and devolatilization. Stochastic effects [98] are shown to exist 
especially in the chemical differentiation. Although the convective flow patterns 
and the chemical differentiation of the oceanic plateaus are coupled, the evolution 
of the time-dependent Rayleigh number, Ra t , is relatively well predictable from run 
to run and the stochastic parts of the Ra t (t ) - curves are small [97]. 

Viereck-Gotte and cooperators worked not only about the genesis of the Jurassic 
Ferrar large igneous province in Antarctica but also about the connection between 
Antarctica, South America and Africa. Plateau forming lavas in the Karoo province 
of South Africa and in the Ferrar province in Antarctica were emplaced syn¬ 
chronously at about 180Ma. They thus seem to originate from the same dynamic 
mantle process. However, both are distinguished by their isotopic characteristics 
with respect to the Rb/Sr- and Sm/Nd systems: while the Karoo magmas show 
mantle values, the Ferrar magmas exhibit enriched upper crustal values. However, 
the boundary between both Jurassic igneous provinces is marked by a large 
transpressional shear zone (Heimefront SZ) of Pan-African age (600-500 Ma), 
crossing Dronning Maud Land (S-African side of Antarctica) on the continent side 
of the Grunehogna craton, a fragment of the W-Gondwana Transvaal craton. This 
shear zone is interpreted as a reactivated suture of Grenvillean age (1.1 Ga). If 
Jurassic magmas on either side of this boundary are isotopically different, it must 
be concluded that 

1. This is not a signature in a lower mantle plume. 

2. This must be a signature within the subcontinental lithospheric mantle. 

3. This signature must be older than Grenvillean in age. 

4. It must be introduced into a paleo supra-subduetion mantle wedge by subduction 
processes if it is a crustal isotopic signature. 

Their studies concentrated on the timing of the initiation of the Ferrar as a 
large igneous province as well as on the physicochemical characterization of the 
melt source region conditions during melt differentiation. Studying the intrusive, 
extrusive and volcaniclastic rocks: 

• They reconstructed the initiation of a large igneous province to have occurred in 
several steps of melt pulses within 5 Ma (189-183 Ma ago). 

• They showed initiation to have started with large-volume shallow-level intrusions 
of low-Ti andesitic melts into wet fluvial sediments (Triassic/Jurassic) associated 
with diatreme-forming Taalian-type eruptions. 

• The eruptions followed by basaltic andesites in small volume eruptions of partly 
pillowed lavas from local eruptive centers prior to large volume plateau forming 
lava extrusions from feeder dikes followed by a final pulse of a large volume 
andesites high in Ti that had differentiated from a common primary melt under 
lower pressure, oxygen fugacity and water activity. 

All melts belong to the tholeiitic differentiation series, however with orthopyroxene 
instead of olivine as early fractionating mafic solidus phase. REE, Sr-Nd-isotope 
and PGE characteristics indicate generation of the primary melts within the 
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spinel-lherzolite zone of an isotopically enriched and sulfur-undersaturated sub¬ 
continental lithospheric mantle. Crustal silica enrichment of this source due to 
an overprint in a supra-subduction environment had already been concluded from 
Re-Os isotopy. Our Sr-isotope data in plagioclase phenocrysts, however, exhibit 
decreasing radiogenic character in subsequent melt pulses indicating additional 
assimilation of crust to decreasing extents during ascent and differentiation. Due 
to a Cretaceous thermal event, 40 Ar/ 39 Ar-ages (ranging from 235 to 90 Ma) and 
S-isotope characteristics are heavily disturbed, only a few samples exhibit the relict 
primary <$ 34 S-value of— 19.5. 


3 The Andean Subduction Model and Its Embedding 
3.1 Some Fundamental Remarks 

Our modeling of the Andean slab should be compatible with our basic assumption 
that plate tectonics is an integral part of the convection of the entire mantle. 
Therefore, we should aim at a regional model which is embedded into a realistic 
convective spherical-shell model. It is possible that the mantle convection is 
characterized by top-down control. But it is also possible that the larger part of 
the mantle mass essentially determines the movements at the surface since the most 
important share of the primordial energy is stored there and also the principal share 
of the radiogenic energy is released there. Here we refer to the absolute value, not to 
the density of these quantities. On the other hand, for reasons of energy, it is evident 
that the Earth’s core cannot play a prominent part in controlling mantle convection. 
It is exactly the reverse. The mantle convection determines the boundary conditions 
of the hydromagnetic convection in the outer core. For example, in time spans 
of high activity of mantle convection, the latter one generates lateral temperature 
differences in D" which reduce the number of magnetic reversals or even make 
them impossible because of the anti-dynamo theorems. 

Furthermore, it is well known at the present time that the oceanic lithosphere 
consists of three (or more) layers which are chemically different and that its lower 
boundary is characterized by a sharp viscosity jump. At the same time the oceanic 
lithosphere is also a thermal boundary layer. So, we do not intend to use the old 
simplified approach that the existence of an oceanic lithosphere can exclusively be 
explained by a thermal boundary layer. In this case we should expect a gradual 
decrease of the viscosity at the lower boundary. Furthermore, we conclude that the 
principal part of the buoyancy of the slab heavily depends on chemistry and that 
phase transitions, especially the basalt-eclogite transition, play an eminent role. 

So, we want to combine a global 3-D spherical-shell convection model with a 
3-D regional convection model of South America and the surrounding plates and a 
model of the orogeny of the Cordilleras. The Andes are an ideal test case for various 


reasons: 
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• They are very large, thus keeping the inevitable problems of different scales at a 
minimum. 

• They are active and well-studied. A wealth of information constrains the 
processes of orogenesis. 

• They have a simple large-scale geometry. The plate margin is gently curved and 
was even straighter in the past. Convergence is orthogonal to the mean trend of 
the margin. The age structure of the Nazca plate near the contact with South 
America is symmetric, with the oceanic lithosphere oldest in the center and 
younging to both sides. 

• Their plate kinematic framework has remained nearly unchanged for the last 
50 Ma. On the other hand, there was a prominent switch in the mode of 
subduction earlier, from presumably steep with backarc extension to low-angle 
with backarc shortening. In the Andes, both end-member settings can therefore 
be studied in the same place. 

• There is a clear compositional and rheological contrast between the two converg¬ 
ing plates. It is unlikely that large volumes of material are transferred across the 
plate contact, in strong contrast to continental collision zones. 

• The Andean substrate has a simple geologic history. Much of the South American 
plate was in place before subduction started some 200 Ma ago. Except for the 
northern Andes, no terranes were accreted after the Paleozoic. 

We aim at a numerical, nearly purely dynamic model with a minimum number 

of restrictions and additional assumptions. This model is to explain the physical 

mechanism of the essential features of Andean orogeny. 


3.2 A Model of the Andean Subduction 

Outlines of the Model 

We do not develop an exclusively regional model of the Andean orogenesis since, 
in this case, the temporally varying boundary conditions are unknown. Therefore, 
it is often assumed, for reasons of simplicity, that there are no effects from outside 
of the regional computational domain. However, some changes in the arc volcanism 
and in the tectonic shortening of the Andes suggest a connection with the 30-Ma 
Africa-Eurasia collision. Therefore, we embed a regional 3D model into a 3D 
spherical-shell model. So, we solve the balance equations of momentum, energy 
and mass in the spherical-shell model using somewhat larger time steps on a 
coarser whole-mantle grid, coarser than in the regional model. The values of 
creeping velocity, temperature and pressure, determined in that way and lying at the 
boundaries of the regional computational domain, then serve as boundary conditions 
for a computation with smaller time steps for which the balance equations are solved 
in the regional computational domain. 
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3.2.1 The Regional Model 

Meanwhile, we have developed a numerical tool for several classes of regional 
models, and it was indeed a great effort. So, we are in no way restricted to the 
model developed in the following. We could as well compare the effects of different 
geological proposals. To design the regional model, we developed not only extensive 
numerical improvements (see Sect. 2.2) in the Terra code but studied also the latest 
geophysical, geological and geochemical results taking them into consideration in 
the draft of a physically reliable mechanism which is not only geodynamically 
probable but also numerically feasible (U. Walzer, J. Kley, L. Viereck-Gotte). 
There are several geologically descriptive model proposals. We mention only 
[13, 28,29,37,71, 76]. Our designed computable model system ought to solve the 
following problems. 

• Why do we observe today in some segments of the Andes flat subduction and 
magmatic lull {Bucaramanga, Peruvian (2-15°S) and Pampean (27-33°S) flat 
slab [75]}, but in other segments normal subduction with dip angles between 30° 
and 40° ? This is amazing since the westward velocity of the South American 
plate and the eastward velocity of the Nazca plate do not essentially vary 
alongshore. 

• Why do the flat-slab segments migrate in Cenozoic times along-side the Andes 
[76]? Ramos and Folguera [77] found an almost continuous belt of flat slabs 
which migrate southward. 

• Why is the present-time volcanism restricted to segments with a 30-40° dipping 
slab [1,36,48,83]? 

• Why does the deformation essentially start in the West and migrate to and finish 
in the East [71]? We derived some estimations on some other hypotheses. After 
that we consider the assumption as most promising to answer the mentioned 
questions by the assumption that oceanic plateaus and aseismic ridges are carried 
by the conveyor belt, i.e. by the Nazca plate. The plateaus and ridges generate 
additional positive buoyancy [27]. So the hinge of the slab migrates to the East 
until the volcanism totally vanishes. However, if we scrutinize a good geological 
map of South America we notice that the relation between flat-slab segments and 
ridges are not simple. We can assign the Pampean flat slab to the Juan Fernandez 
Rigde [2]. Even the southward migration of the Pampean flat-slab zone can be 
explained by form and movement of the Juan Fernandez Rigde [37], The southern 
part of the Peruvian flat slab can be referred to the Nazca Rigde. For the northern 
part of the Peruvian flat slab we have to introduce the hypothesis of an immersed 
Inca Plateau [28]. However, east of the Carnegie Rigde there is an abundant 
volcanism in Ecuador. Michaud et al. [64] show a detailed reconstruction of the 
eastward movement of the Carnegie Rigde which extends at least 60 km below 
the South American plate with a continuous plunging slab down to a depth of 
200 km. The adakitic signal is proposed to be ridge-induced. In the case of the 
Iquique rigde we have to assume that there is no eastern continuation of it. Pindell 
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and Kennan [74] show that the flat-slab area in northern Colombia and Venezuela 
might be considerably larger than the area assumed by Ramos [76]. 

• Why should our mechanism of the generation of the Andes be three-dimensional? 
Several observations suggest the idea that essential features of the Andean 
orogenesis cannot be explained by 2-D dynamical models: 

(a) Hindle et al. [33] conclude that a mass balance which is restricted to a cross- 
section through the Andes leads into contradictions. They show that there 
has to exist an essential mass transport alongside the Andes. In particular, 
the displacement of material toward the axis of the bend in the central Andes 
leads to a significant crustal thickening. This cannot be explained by a two- 
dimensional model, neither kinematically nor dynamically. 

(b) Anderson et al. [3], Kneller and van Keken [53] and Barnes and Ehlers 
[5] show and discuss, for the southern Andean subduction zones, trench- 
parallel high seismic shear velocities near to the trench and an abrupt 
transition to trench-perpendicular high seismic shear velocities in the back 
arc. The Brazilian subcrustal lithosphere beneath the eastern Cordillera, the 
Interandean and the Subandes are East-West fast whereas the shear-wave 
velocity under the Altiplano and Puna has maximum values in the North- 
South direction. This is a hint that a significant three-dimensional flow might 
be involved in the mechanism. 

(c) The toroidal component. A pertinent argument for the necessity of 3-D mod¬ 
els results from the following considerations and calculations. Already Gable 
et al. [20] and O’Connell et al. [70] showed the relevance of the toroidal- 
poloidal partitioning for lithospheric plate motions. The lateral subducting 
slab movement induces slab-parallel flows and a rollback-generated flow 
around the slab [81]. Stegman et al. [86] demonstrated that in the case of 
non-vanishing rollback of the subduction slab for some typical cases, 69 % 
of the energy of the negative buoyancy of the slab is converted into the 
toroidal component of the rollback-induced flow whereas 18 % are consumed 
for the weakening of the plate. These numbers show the importance of a 
3-D modeling in a striking way. Only in the very beginning of the Andean- 
specific modeling we prescribe the velocity of the migrating subduction 
hinges as a function of time according to [71] in order not to overburden 
the model. But the other degrees of freedom of the flexible slab should be 
determined by the differential equations of the model. We vary the lateral 
extend of the individual slab between 200 and 5,000km. That is, in a first 
type of numerical experiments we introduce the individual lobes shown by 
the distribution of seismicity. In a second numerical experiment we assume 
an undivided slab for the whole South American continent (Fig. 7), at best 
with a disconnection at the Chile Rise. For all versions we cut out a spherical 
diamond (Fig. 8) from the newly improved and inf-sup stable Terra code 
which is able to solve the convection differential equations in a spherical- 
shell mantle. Essential parts of the South American plate and the Nazca 
plate fit into this spherical diamond. The vertical boundary conditions are 
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Fig. 7 The geometric starting configuration in our second Andean regional model (Taken 
from [92]) 



Fig. 8 The diamond (red), embedded in the global grid, which will be dyadically refined to the 
model resolution (Modified from [6, Fig. 1], Of course, we use a further refinement of the grid) 


iteratively taken from a whole-mantle convection model. Cf. Sect. 3.2.2. In 
contrast with [86], we do not neglect the energy equation. It is evident 
that the subduction mechanism is possible only assuming a low-viscosity 
asthenosphere [18] which is less dense than the average oceanic lithosphere. 
Furthermore, the rheology may not be purely viscous. For this purpose, 
we [97] prefer a viscoplastic yield stress. The lithospheric-asthenospheric 
boundary for the viscosity is relatively sharp, also for the oceanic lithosphere. 
This is a special challenge for our code. 

















490 


U. Walzer et al. 


magmatic gap magmatic gap magmatic gap 



125-90 Ma arc I 78-37 


Peruvian 

shortening 



H i 

t ElAixa-' 0 i 


.. , 3IIUIICIII 

main activitv of h 
A tacam; 



-40 km -►20 

SVZ 34-35°S 


alkaline * • •! 

• mw :*• 


30-35 km “* 

SVZ -07.5°$ jj*. 


200 


150 100 

Age (Ma) 


50 


0 


Fig. 9 The La/Yb ratio as a function of time for the igneous rocks of the north Chilean arc (21- 
26°S) acc. to Haschke et al. [29]. Note that after the flat-slab stage, the La/Yb suddenly decreases 
to a low starting value. The periods of orogenetic activity are before these drops 

• By the model, it should be possible to understand essential geochemical obser¬ 
vations relevant for South America. The difficult problem of delamination is 
very probably connected with this issue. It is necessary to clarify what kind 
of delamination is dominant. Figure 9 shows the gradual increase of the flow 
of incompatible elements. After each flat-slab period, this rise is abruptly 
interrupted. Such kind of plots exist not only for the mass ratio La/Yb but 
also for 87 Sr/ 86 Sr and 144 Nd/ 143 Nd. These curves describe a slow growth of the 
abundances of elements with large ionic radii within one cycle since 87 Sr is 
the daughter isotope of 87 Rb, I43 Nd is the daughter isotope of 147 Sm and the 
other mentioned isotopes are stable. The fluctuations of the whole-rock initial 
c N d of the central Andean arc by DeCelles et al. [13] are evidently related to 
Fig. 9. DeCelles et al. [13] emphasize that the cyclical changes of the isotopic 
compositions of arc magmas cannot be explained by changes in the convergence 
rates. They and also we expect that these changes have to be explained by 
episodic gravitational foundering. There are two principal possibilities. 

(A) According to [13], below arc and hinterland, i.e. below the western parts 
of the South American continent, the eclogitization of the thickening lower 
continental crust and of the lithospheric mantle causes a density increase 
and therefore a delamination so that these units sink into the mantle wedge 
driven by their own weight. Carlson et al. [10] discuss the possibility that 
the continental mantle above the wedge of the mantle overlying a subducting 
oceanic plate can become unstable. The detachment from the overlying 
continental crust can cause major orogenetic episodes. Davidson and Arculus 
[11] propose a delamination of the cumulate layers below the seismological 
Moho back into the mantle of the sub-arc wedge. The two-dimensional 
numerical model by Sobolev et al. [84, 85] is compatible with the geological 
models mentioned under a) and belongs to b)-type of papers mentioned in 
the Introduction. They use a viscoelastic rheology supplemented by Mohr- 
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Fig. 10 A calculated marble-cake mantle model acc. to Walzer and Hendel [99]. This is an 
equatorial section showing the present-time state of the chemical evolution of incompatible 
elements of the Earth’s mantle. We use a modernized reservoir theory. The depleted MORB mantle 
(DMM) and a mantle which is rich in incompatible elements, yet, are strongly intermixed. Strongly 
depleted parts of the mantle which include more than 50 % DMM are represented by yellow areas. 
Relatively rich mantle parts with less than 50 % DMM are orange-colored. In general, the yellow- 
orange boundary does not correspond to a discontinuity of the abundances of U,Th,K, etc. The 
cross sections through the continents are red. Black dots represent the oceanic plateaus. The yield 
stress is 125 MPa, the viscosity-level parameter is —0.50 

Coulomb plasticity for the layered lithosphere. The drift of the overriding 
plate and the pulling of the slab is prescribed by the velocities at the 
boundaries of the 2-D model area and it is not calculated by solution of 
the balance equations though. What drives Andean orogeny? Sobolev et al. 
[84,85] answer this question by numerical experiments using their 2-D model 
and varying only one influence parameter each. They conclude that the major 
factor is the westward drift of the South American plate. Paragraph (A) 
outlines only one way of thinking which we intend to test. 

(B) Here we propose a second hypothesis to understand the mentioned geochem¬ 
ical observations of [29] and of [13, Fig. 5]. This hypothesis is patronized 
by the idea of a geochemical marble-cake mantle [97, 99] (see Fig. 10) but 
composed of irregularly formed parts of a depleted MORB mantle (DMM) 
with 80-180ppm H 2 O and 50 ppm C and another, richer reservoir with 
550-1,900ppm H 2 0 and 900-3,700ppm C [23, 111,1 12]. The mid-oceanic 
ridge basalt is denoted by MORB. It is possible that the reservoirs are 
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intermixed in smaller quantities so that there is no sharp boundary between the 
different parts of the mantle [34,89]. DMM dominates in the smaller depths 
of the mantle. The deeper the slab dives into the mantle, the higher is the 
probability to touch regions with a high abundance of 3 He and incompatible 
elements. Pilz [73] investigated 3 He/ 4 He ratios in the Puna plateau and at 
the volcano Tuzgle. He found that these 3 He/ 4 He values are higher than in 
the western Cordillera and in the Salta Basin east of it. Pilz and also we 
conclude that a provenance by degassing of the slab is not feasible since the 
mantle’s 3 He is primordial. The 3 He of the atmosphere cannot be subducted 
in appreciable quantities. 

There are different, but related models of chemical layering inside the 
subducting oceanic lithosphere [21, 80]. There is a water-rich subduction 
channel above the slab. Because of the sediment dehydration and the oceanic 
crustal dehydration it is not entirely clear how deep this hydrous channel is 
extended. Not only the 3 He/ 4 He ratio of the Puna plateau [73] but also La/Yb, 
87 Sr/ 86 Sr and l44 Nd/ 143 Nd increase from an age of r = 4.5 Ma until the present 
time [29], There are different explanations for this phenomenon. One of them 
would be a rise in an antiparallel direction in hot fingers immediately or in 
some distance above the subducting slab. Such a suggestion has been offered 
by Tamura et al. [91] for the NE Japanese arc. Marsh [63] proposed a similar 
idea of a hydrothermal flow field, in this case immediately above the upper 
surface of the down-going slab. In this way he explains also the sharp line 
of volcanoes or the volcanic front which can be observed in the Andes. Pilz 
[73] concludes from seismic observations that such hot fingers are also in the 
wedge of the southern central Andes and that these fingers are near the surface 
of the slab. So we want to develop a numerical model with fluid pathways of 
hydrothermal fluids, described by a particle approach, not very far from the 
slab but with an antiparallel flow direction. This idea is corroborated also by 
Furukawa [19]. 

We are working about the numerical problem to cut out a spherical diamond, 
out of the dynamical 3-D spherical-shell model, and to determine the temporally 
changing boundary conditions at the vertical side walls of the 3-D diamond by the 
solution of the convection in a spherical shell, but now with prescribed plates at 
the surface of the sphere. The Nazca plate moves with a (in the first approach) 
prescribed angular velocity through the margin into the computational domain and 
dives under the South American continent because of a Rayleigh-Taylor instability 
which is induced mainly by the transition to eclogite. The oceanic plate has a given 
sandwich structure of zero-pressure densities, po , [21] and viscosities, //o. The slab 
is supposed to be freely movable at the surface and floating inside the mantle. It is 
well-known that it is difficult to detach a spherical-shell plate from the surface of 
the shell into the mantle in a slab-like manner [8]. As in [97], we want to introduce 
a viscoplastic yield stress at the near-surface lithospheric region and compare the 
creeping viscosity, rj c , at each position and time with the plastic viscosity, r] p . and 
use the minimum value in the model. This procedure is sensible from the physical 
point of view. If a piece of the Nazca plate which is thickened by an oceanic plateau 
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or a passive ridge approaches to South America then it pushes the hinge back under 
the South American plate because of its positive buoyancy. Therefore the volcanic 
front migrates to the East. The residual slab will become steeper and its tip will 
touch deeper and deeper parts of the mantle. Therefore, more and more atoms of 
incompatible elements will rise using the fluid pathways explaining the rising parts 
of the curve of Fig. 9. We are able to describe the movements of the hydrothermal 
fluid by a tracer modulus. These particles, describing the migration velocities of 
higher abundances of incompatible elements, 3 He, metals like Cu, Ag, etc., are not 
carried along with the creeping rock. They can move along a surface antiparallel 
to the slab which represents a thin layer. A serpentinized layer-like part of the 
mantle wedge just above the subducting oceanic crust could serve in reality as such 
a thin layer [25]. The slab movement is (in this model as in reality) an integral 
part of the solid-state mantle convection. The mentioned quantities po and r )o of 
the slab will be additionally described by other tracers which, in this case, will 
be entrained and carried along by the solid-state creep down to the interior of 
the mantle. The generation and eastward migration of major Andean ore deposits 
[29,38,39,79], the eastward migration of the volcanic front, the eastward movement 
of the deformation ages across the southern central Andes (21°S) [17,71] as well 
as the eastward migration of deformation in the Interandean and Subandean [16,41] 
can be dynamically modeled by our approach. When the subducting movement of 
the down-hanging part of the slab (and the flat-slab part behind) is accelerated, 
then an elevated activity of the Atacama fault zone, the Peruvian shortening, the 
Incaic shortening etc. are induced. The sudden interruption of the supply of highly 
incompatible elements (Fig. 9) must be induced by a radical event. Using the 
analogous, geologically describing models of [59, Fig. 4] and [35, Fig. 7], it cannot 
be concluded that many unusual features of the Permian-Jurassic South China fold 
belt can be explained by shallow subduction with an extensive final foundering of 
the lower plate combined with a roll-back of a small remnant subducting slab in 
the Mid-Jurassic. Humphreys [35] concludes for the Mid-Tertiary that an extensive 
sinking of the flat Farallon slab occurred which caused an uplifting of the continent 
covering a large area of the western North America. We propose an explanation by 
an extensive generalized eclogitization of the flat-slab part of the oceanic plate and 
an extensive delamination due to a Rayleigh-Taylor instability. The latter process 
can be simulated for the former and present South American flat slabs using a 
particle approach. The extensive tear-off would entirely interrupt the supplies of 
incompatible elements, 3 He, etc. since the tracer transport path near the upper 
surface of the low-hanging part of the slab and the flat part of the slab is ripped. 

At first glance, (A) and (B) seem to be competing models, but in reality they 
do not entirely exclude one another. It is evident that the two roughly sketched 
numerical regional models are very ambitious. Therefore, we could report here only 
on the numerical and mathematical results of the preparatory efforts, the further 
development of the code and our geodynamical conception. However, a spherical- 
shell model is finished (cf. 2.1). 
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3.2.2 Spherical-Shell Model: Prescribed Angular Velocities 
of the Lithospheric Plates 

To define the time-dependent boundary conditions of the vertical side walls of the 
spherical diamond containing South America and the Nazca plate, we introduce 
a spherical-shell convection model with the same radial profiles of the relevant 
physical parameters as in [100,101] but with prescribed angular velocities of the 
plates for the last 200 Ma. This desirable time span is essentially determined by 
the available observational data, e.g. by Fig. 9. But the most plate-motion models 
do not go back so far. However, the main difficulty refers to another item. Even 
though we know all velocities of neighboring plates between each other, we do not 
know the “absolute” velocities of the plates relative to the highly viscous mid part 
of the lower mantle. But the net rotation of the averaged lithosphere is important 
for the kinematic analysis and the dynamic modeling of the slabs, especially the 
slabs at the margin of South America. The “global tectonic map” of [82] is based 
on the Indo-Atlantic hotspot reference frame by O’Neill et al. [72] and on the 
relative plate motion model by DeMets et al. [14]. This map shows a large eastward 
motion of the Nazca plate, large in comparison to the magnitude of the velocity 
of the South American plate. In relation to this feature, this map is similar to the 
deforming, no-net-rotation reference frame model GSRM by Kreemer et al. [56]. 
In contrast to these two models, the hotspot reference model HS-3 [24] shows 
a high westward velocity of the South American plate, high in comparison with 
the amount of the velocity vectors of the Nazca plate. HS-3 is based on the age 
progressions of ten Pacific ocean islands. Becker [7] and Becker and Faccenna [8] 
wrote a very good analysis of the problem of “absolute” plate velocities. Becker [7] 
and Long and Becker [62] try to determine the present-day plate velocities from the 
convective shearing movements and the seismic anisotropy of the upper mantle. But 
this procedure is more sensitive to the direction of the velocity vector than to its 
magnitude. HS-3 contains a large net rotation of the laterally averaged lithosphere 
relative to the high-viscosity parts of the lower mantle. The majority of the authors 
derived a smaller real net rotation which is defined as the spherical harmonic degree 
1=1 component of the toroidal part of the plate velocities. Ricard et al. [78] 
estimated 30% of HS-3, Steinberger et al. [88] calculated 38%. The azimuthal 
seismic anisotropy is compatible only with values less than 50 % of HS-3 [7]. 

Already early, Steinberger and O’Connell [87] linked the hot spot tracks and 
the movement of oceanic lithospheric plates. Gurnis et al. [26] report on a very 
practicable open-source system which contains the angular velocities for the plates 
from 140 Ma to the present time. Each plate has a time-dependent Euler pole. The 
plates are described by time-dependent closed plate polygons. Each of these plate 
boundaries has its own, time-dependent Euler pole. The code allows to introduce 
new interactive plate boundaries [9]. We intend to use two or three seemingly 
realistic spherical-shell plate-motion models to define the boundary conditions 
of our regional convective system of South America and its surrounding. The 
repercussions of the different spherical-shell systems on the regional model should 
be investigated and compared. 
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3.2.3 Future Numerical and Technical Improvements 

In addition to the joint work within the group of Terra developers, further develop¬ 
ments are to be continued regarding the following items. 

• We are going to further improve the parallelization of the particle tracking 
routines in Terra. Compared to previous Terra versions, there is an extra need 
for communication among several MPI-processes to figure out connected regions 
of partial melting in the mantle from which incompatible elements are extracted 
and transported to the surface. A similar communication is required to define 
the extent of continental lithospheric plates. With the high number of tracers, 
it is crucial to compress the required global information locally before it is 
exchanged among neighboring processors. R. Hendel will continue to reduce 
the communication overhead for tracking globally connected regions, so that the 
scalability of the particle routines will be extended to 500 and more processors. 
Such an optimized way of communicating global regions and features is also 
needed in modeling the elasticity of the subducting plates in the regional Andean 
model. 

• We also plan to bring the elasticity model in Terra to work. It has also to be chosen 
carefully how elasticity is dealt with in the solution of mass and momentum 
equations. It could be necessary to iterate over the whole Stokes within every 
time step until an equilibrium between elastic and viscous forces is reached. 

• To run both, regional and global models, with time-dependent boundary condi¬ 
tions at the surface, plate reconstruction data will be imported from the GPlates 
code [9] (www.gplates.org). The development of the interface between GPlates 
and Terra will be done together with L. Quevedo, Sydney. 

• Furthermore, the documentation of the code, which is build from source code 
comments automatically with doxygen, will be enhanced to make it easier for 
new developers to work on Terra. 
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Euler Deconvolution of GOCE Gravity 
Gradiometry Data 


M. Roth, N. Sneeuw, and W. Keller 


Abstract Euler deconvolution is a standard tool of geophysical prospection. In 
the early 1980s, the beginning of its development, it was used for the evaluation 
of magnetic field data. However, since the 1990s, together with the increasing 
power of computers, research was intensified on the aspect that Euler deconvolution 
is also applicable to gravity gradiometry data. Now we are in the position that 
gravity gradiometry data with near global coverage from a single source are 
available, namely the satellite mission GOCE (Gravity field and steady-state Ocean 
Circulation Explorer), launched on 17 March 2009. 

In this project we investigate the benefit of Euler deconvolution to geodesy, e.g. 
to retrieve global gravity models. We also assess whether geodetic methodology 
can contribute to enhance Euler deconvolution. Until now our project is still 
in preparatory stage, mainly, because the GOCE gradiometry data need to be 
preprocessed extensively. 


1 Introduction 

Euler deconvolution is a semi-automated method to estimate possible locations of 
magnetic field or gravity held sources. Since the paper of Thompson [10], who 
made Euler deconvolution applicable to magnetic data by the use of computers, this 
method became of great interest in research. 

Marson and Klingele [5] applied Euler deconvolution for the first time to gravity 
vertical gradients, Zhang et al. [12] applied it to full gravity gradient tensor data. 
Usually, such tensor data are measured on shipborne or airborne platforms for 
relatively small areas (several hundreds of square kilometers). 
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The satellite mission GOCE provides a global geoid model in unprecedented 
precision [7]. The nominally planned operational phase of 12 months was finished 
on 2 March 2011 after 2 years in orbit. (The operational phase was interrupted by 
calibration phases and failures of the on-board computers which could be fixed.) 
However, the satellite’s overall good health, as well as its excellent data quality led 
to the extension of the mission at least until the end of 2012 [3]. 

The data provided by GOCE are mainly the full gravity gradient tensor, as well as 
the satellite’s coordinates for every observation epoch. However, satellite positions 
and data are given in different reference and time frames, which is one reason why 
a preprocessing of the data becomes necessary. 

Up to the present, GOCE delivered a vast amount of data, the sample frequency 
being 1 Hz. The amount of data (around 70 GB at the moment), and the computa¬ 
tional method of Euler deconvolution as presented in the following section, makes 
high performance computing (HPC) a necessity. 


2 Euler Deconvolution 


2.1 Euler’s Homogeneous Function Theorem 


The general form of Euler’s homogeneous function theorem reads 

x-V/(x) = n /(x). 


( 1 ) 


Here x stands for the vector {x \,, XkY and V is the gradient operator 
(j^-,..., 3 I 7 ,... g— ) T . The dot product of both yields x-V = ( x\ ^ 


' Xi Wi 


k 9*1 5 

If we now take the first-order derivative to one of the variables, e.g. x,-, we get 
for the left hand side 




= w-Xi + x • V /-/(x) 

uXj OXi OXi 




7 T“~ /(x) + X • V /(x) . 

OXf OXi 


The right hand side reads then 


dx/ 


nf (x) 


= n- —/(x). 

OXi 
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Putting both sides back together yields 


9 9 9 

—/(x) + x-V —/(x) = n -— /(x). 
oXi oXi dxi 


which can be simplified to 


X- V-^—/(x) = (n - 1)-^—/(x). (2) 

dXj dXj 

This expression now has the same form as (1). It shows, that the first-order partial 
derivatives f (x) are homogeneous of degree n — 1. 


2.2 Gravity Gradients 


Because of the principle of superposition, we can consider Earth being assembled 
from numerous point masses. One of those point masses, located at the point 
To = (To, To, Zo)> produces at the measurement location r = (x, y. z) the potential 


8V = - 


GSM 

r 


(3) 


where r = ^{x — Xo) 2 + {y — To) 2 + (z — Zo) 2 is the distance between ro and r, 
G = 6.67384-10 -11 [2] is the Newtonian constant of gravitation and SM is the 

mass of the probe. We obtain the acceleration <5g by taking the gradient of (3). 


( SV A 

Sg = V8V = I 8V y I 

\svj 


fd8V/ dx\ 
d8V/dy 
\d8V/dz) 


GSM 

r 3 


( x-x 0 \ 
T - To I 
z-zo / 


By a further gradient operation we obtain the gravity gradient tensor 


(4) 


(8V XX 8V xy 8V XZ \ 

W8g = V (g> V8V = I 8Vy X SVyy 8 Vy Z I 

8V Z y SV^J 

/ d 2 8V/dx 2 d 2 8V/dxdy d 2 SV/dxdz\ 

= d 2 SV/8ydx d 2 8V/3y 2 d 2 8V/8ydz ■ (5) 
\ d 2 SV/dzdx d 2 8V/dzdy d 2 SV/dz 2 / 





506 


M. Roth et al. 


Four gradients are linearly dependent because of the tensor’s symmetry, i.e. 

8V xy = 8 V yx , ( 6 ) 

= SV ZX , 

8V yz = 8V zy , 

and the validity of Laplace’s equation 

8V XX + SV yy + 8V zz = 0. (7) 

Therefore it is sufficient to specify the following five independent tensor elements: 

G 8M , 9 9 \ 

SV XX = (-3(x - x 0 ) 2 + r 2 ) , ( 8 ) 

„ GSM 

8V xy = —(-3(x - x 0 )(j - >'o)), 

8V XZ = (-3(x - x 0 )(z - zo)), 

r* 

8V yy = ~T~ ( _3 ^ _ • Vo ^ 2 + r2 ) ’ 

8V yz = (—3 (y - yo)(z - zo)). 

r 3 

As an example, the gravity gradient tensor components of a single point mass are 
displayed in Fig. 1. 


2.3 Standard Euler Deconvolution 

This subsection portrays the standard approach for Euler deconvolution as presented 
e.g. by Thompson [10], Reid et al. [ 8 ], Wu [11] or Mushayandebvu et al. [ 6 ]. Euler’s 
theorem (1) for the 3D case is given by 

3/ 3/ 3 f 

(x -x 0 )^~ + (y ~yo)^~ + (z-zo)-^- = nf (9) 

ox ay dz 

again, (xo, Vo, Zo) represents the source location, (x, y, z) the measurement location, 
/ the field and n the degree. 

Additionally, for the computation of the Euler deconvolution, we assume that the 
field f is the sum of a constant base field B and the difference Af to the actual 
field 
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Fig. 1 Example of the gravity gradient tensor components Tjj of a point mass [9] 


f = Af + B. 

We also benefit from the fact that (because of B = const.) it holds 

l(Af + B)=^-Af=lf 
dz dz dz 

In consequence, (9) becomes 


(x -x 0 )^- + (y- ya)%- + (z - Zo)Tj— = n(Af + B). 
ox dy dz 


( 10 ) 


To insert the gravity gradients, we set e.g. f = 8 V, and = SV XZ , ^ = SV yz etc. 
and get 


(x - x 0 ) 8V XZ + (y- yo)8V yz + (z-zo) 8V ZZ = (n - 1 )(ASV Z + B z ). (11) 

The same procedure can be applied to the other gravity vector and tensor compo¬ 
nents. Hence, it is possible to use the complete gravity gradient tensor to determine 
the source of the gravity anomaly. However, this example sticks to the ."-components 
of the gravity gradient tensor as done in (11). 
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In a next step, the observation data usually is either interpolated to regularly 
spaced points on lines or planar grids. A window of size m slides over the 
interpolated data and for each window a solution for the unknown parameters 
is found by a least squares estimation (Gauss-Markov model). However, if the 
measurement positions are uniformly distributed, this interpolation step might not 
be necessary. A reformulation of (11) for m sample points (setting N = — (n — 1)) 
yields 


^ 8V XX 'i 8V yz _\ 


SV XZ ,i -N\ 


/ x o\ 
yo 


\8V xx , m 8V yz . m 8V xz , m -N 


zo 

\B Z J 


* xi 8V XZi i + y i 8V yz ,\ + zi 8V ZZ ,\ + N SV Zi i ^ 
V Xm 8 V xz ,m T" Vm 8V yZjJn T" Zm 8V zz ^ m T" N 8V z , m J 


( 12 ) 


[6]. N is the so called structural index and is also unknown per se (but needs to be 
chosen in the process of Euler deconvolution, e.g. N = 2 for point masses). 

Now (12) is of the form 

i] = A£ + e . 

with the vector of “observations” t] being the right hand side of (12), design 
matrix A, vector of parameters § and the residual vector e which holds model and 
measurement errors. 

We need to choose an adequate window size that the window contains enough 
measurement positions and, in consequence, the problem becomes overdetermined. 
Then we are able to treat it by the least squares method to obtain the estimated 
parameters £. 

rj = A£ + e =>■ A t j/ = A T A£ =A | = (A t A) _1 A t j/ , 


In this way, for each window position one estimate of the source’s location is 
computed. Those estimated locations tend to cluster in zones of contrast in a field, 
hence might be of geological or geophysical interest [ 1 ]. Choosing the right window 
size is a bit troublesome. On the one hand, it has to be large enough that a single 
source gravitational effect is covered; on the other hand, it should be small enough 
that significant effects from multiple sources are not included [11], A synthetic 
example of Euler deconvolution of multiple sources is given in Fig. 2 for the 2D 


case. 
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Fig. 2 Synthetic example of Euler deconvolution, 2D (along-track), multiple sources [9], (a) SV Z 
{blue), window size {red)\ (b) &V XZ {black), SV^ {red)\ (c) source {green circle) and solutions of 
Euler deconvolution {blue)', (d) counting the amount of solution per grid cell (darker = more); 
(e) sum of reciprocal distances to the other solutions (blue circles) 


2.4 Extended Euler Deconvolution 

The standard Euler deconvolution, as presented above, disregards the fact that the 
gravity gradients are measured quantities. Measured quantities are never error free. 
As the design matrix consists mainly of such measured quantities, the included 
errors lead to an inaccurate estimation of the parameters. In such a case the Gauss- 
Helmert model is the better choice in terms of estimation precision. In this model 
also the measured quantities’ errors can be considered. However, the drawback is 
that matrices get several times bigger due to the additional conditions. We also want 
to retrieve variance-covariance information, hence we need one inversion of a matrix 
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per window. For the standard Euler deconvolution the size of that matrix depends 
only on the amount of unknown parameters, i.e. matrix size is 4 x 4. The matrix size 
of the extended Euler deconvolution depends also on m (the amount of observations 
in each window), i.e. the matrix size increases to 4m + 4 x 4m + 4. This leads to a 
increase of computing time increase by a factor of 60. 

Measured—and as such stochastic—quantities are x, y, z, SVjj , AS V/ ; unknown 
quantities are xq, yo, Zo, Bj ( i , j = x, y, z). Exemplarily, out of the three possible 
equations, let us examine (11), while taking into account the stochastic quantities 
(e.g. x becomes now x + e x ) 


(x + e x - x 0 )(8V xz + e S v xz ) + (y + e y - yo){SV yz + e S v yz ) 

+ (z + e z ~ zo)(8V zz + e$v yz ) + N(A8V Z + e§v z + B z ) = 0. (13) 

Equation (13) is not linear anymore, hence, linearization becomes necessary. 
Additionally, we can assume that the errors of (x, y , z) are very small (the 
position of GOCE is known from GPS positioning with high accuracy, i.e. we set 
e x = e y = e :: = 0). Rewritten in matrix-vector notation, we get 



N 


~ Ax 0 " 
Ay 0 
Azo 

.AB z _ 


+ 



0 . . 0 0 
*o) (y + e y - y 0 ) 


/ 0 0 X 

(z + e z - zo) 


N 


e sv xz 

esv yz 

_ eA$v z _ 


+ (x- jco) 8 V xz + (y -y 0 )8 V yz + (z-z 0 )8 V zz + N(A8 V z + B z ) = 0. (14) 

Here, ° on top of a variable indicates, that this variable is evaluated at the Taylor 
point. Equation (14) is nearly in the form of the Gauss-Helmert model, so we can 
write 

AA§+B T e + w = 0 (15) 


X, 0 ... 0 Fi 0 ... 0 Z, 0 ... 0 JV 0 ... O' 
T 0 X 2 0 ... 0 1) o ... 0 Z 2 0 ... 0 IV 0 ... 

b t = 

mx4m 

_O...OX m O...OY m O...O Z m 0 ... 0 N _ 


with 
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0 0 0 0 0 0 
where Xj = Xi+e x j-x 0 j, F, = yi+e y<i -y 0i and Z, = z,+e Zil -zoj, i = 1,... ,m, 

and 

eT = [ e&v x .,\ ■ ■ ■ esv r „t • • • esv,.,i ■ ■ ■ esv,,i • • • 1 • 

4m xl L r 

Matrix A (of size 77? x4) and vector w (of length m) can be derived directly from (14). 
Minimizing the Legendre function 




1 

2 


e T e + A t (A + 


B T e + w) 


min 

A$.e.X 


leads us to the linear system of equations 


"0 0 a t ~ 



' 0 " 

0 I B 

e 

= 

0 

A B t 0 _ 

_ . 


_w 


(16) 


whose solution is refined iteratively, until the vector of increments A I- becomes 
small enough to meet the accuracy threshold. The initial values of the variables 
evaluated at the Taylor point are obtained by a preceding standard Euler deconvolu¬ 
tion. 

As a preliminary result, in Fig. 3 solutions of Euler deconvolution of the first 
2 months of GOCE data over Africa are displayed. In this early stage, we assumed to 
detect point masses, i.e. N = 2, well knowing that the structures are not that simple 
in reality. At the moment we cannot explain the results of the depth estimates of the 
Euler deconvolution solutions which roughly lies around 4430 km distance from the 
geocentre. 


3 Windowing of the Data 

We already mentioned earlier, that Euler deconvolution is achieved by sliding a 
window over the data and estimating one set of parameters per window. However, 
the measurement data are ordered chronologically, i.e. we get the data along the 
satellite orbit. In conclusion we have to project a grid on those data to get the data 
windows for Euler deconvolution. 

The simplest approach would be to divide the data according to Earth’s already 
present latitude-longitude grid. This has the disadvantage, that, depending on 
latitude, the window does not cover an equally sized area. Hence we follow the 
more sophisticated approach of a geodesic sphere (cf. Fig. 4). 

Starting with a icosahedron whose top and bottom vertices coincides with the 
poles, we first sort the data to its 12 faces. Afterward, we divide each face into 
four new ones whose vertices are projected on the surface of the sphere. Again, 
we sort the data of the face to the four new faces. The last two steps are repeated 
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Fig. 3 Euler deconvolution of GOCE data of the first 2 months over Africa—still, inteipretation 
is necessary. Top: latitude-longitude grid (with heavy aliasing), bottom: geodesic grid (with rather 
random errors) 


until we reach the desired resolution. Hierarchical sorting speeds up the program 
enormously. For example, instead of checking all 12-4 4 = 3072 faces of a plain “step 
4” geodesic sphere where an entry belongs (worst case scenario), for hierarchical 
sorting we need to check in total only 12 + 4 • 4 = 28 faces but at different step 
levels. 
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Fig. 4 Different steps of a geodesic sphere—starting with an icosahedron (step 0) on the left and 
quadrupling the number of faces in each next step 



We wrote a C-library, which is based on the ideas of [4] but additionally can 
handle our data and the subdivisions of the icosahedron. The whole data structure is 
realized by pointers like the scheme in Fig. 5 illustrates. Using pointers for the data 
gives us the benefit that the amount of copied data is much lower (e.g. 8 bytes for a 
pointer in comparison to around 100 bytes per complete entry). 


4 Discussion, Conclusion and Outlook 

We prepared the theoretical background of Euler deconvolution with the geodetic 
enhancement of the Gauss-Helmert model. We also applied Euler deconvolution to 
non-preprocessed GOCE data and found that the results are not yet interpretable. 
Hence, further research is necessary for a better understanding of the data, as well 
as a better interpretation and understanding of the results. 
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For our proposed “extended Euler deconvolution” via the Gauss-Helmert model, 
computation time increases exponentially: The observation equations become non¬ 
linear, hence, they must be linearized. Which in turn demands iterations to solve 
for the unknown parameters. Additionally, for our approach a 4/7? + 4 x 4 m + 4 
matrix has to be inverted (//? being the number of observations within one window), 
compared to the 4x4 matrix of the standard Gauss-Markov model approach). 
The matrix inversion becomes necessary in order to retrieve variance covariance 
information. Altogether, this leads to a computation time increase by a factor of 60. 

As the solutions for every Euler deconvolution window can be computed 
completely independently, we can parallelize our program and hence benefit from 
HPC. 

At the moment, we are preprocessing the GOCE data and at the same time 
preparing a first implementation of our program in C as proof of concept with 
synthetic data. 

The steps after this will be the parallelization of our program and its implemen¬ 
tation on the CRAY XE6. 
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Parameterization of Threshold Accepting: 
The Case of Economic Capital Allocation 


H.-P. Burghof and J. Muller 


Abstract The corresponding model of economic capital allocation was described 
as a mixed integer nonlinear program (MINLP) in Burghof and Muller (Allocation 
of economic capital in banking: a simulation approach. In: Nagel WE, Kroner DB, 
Resch M (eds) High performance computing in science and engineering. Springer, 
2011). In this context an appropriate solving algorithm in form of threshold accept¬ 
ing was introduced. Now we address the algorithm’s parameterization which is 
influenced by the model’s specific implementation of threshold accepting. Without 
discussing the implementation in detail, an impression of the parameterization’s 
handling and robustness in context with the model is given. On the basis of adequate 
parameterization the model provides a first indication of optimal economic capital 
allocation’s superiority compared to alternative allocation methods. 


1 Introduction 

The model addresses the allocation process of economic capital in banks. Financial 
institutions hold economic capital in order to cushion unexpected losses and to 
prevent bankruptcy. Thereby, the capital at the same time mitigates the risk of 
contagion in the financial system. 

The corresponding model of economic capital allocation was described as a 
mixed integer nonlinear program (MINLP) in [1]. In this context an appropriate 
solving algorithm in form of threshold accepting was introduced. Now we address 
the algorithm’s parameterization which is influenced by the model’s specific imple¬ 
mentation of threshold accepting. Without discussing the implementation in detail, 
an impression of the parameterization’s handling and robustness in context with 


H.-P. Burghof (El) • J. Muller 

Lehrstuhl fur Bankwirtschaft und Finanzdienstleistungen, Universitat Hohenheim, 70599 
Stuttgart, Germany 

e-mail: Burghof@uni-hohenheim.de; Jan.Mueller@uni-hohenheim.de. 


W.E. Nagel et al. (eds.), High Performance Computing in Science and Engineering ’12, 
DOI 10.1007/978-3-642-33374-3-37, © Springer-Verlag Berlin Heidelberg 2013 


517 



518 


H.-P. Burghof and J. Muller 


the model is given. On the basis of adequate parameterization the model provides 
a first indication of optimal economic capital allocation’s superiority compared to 
alternative allocation methods. 

Before concentrating on parameterization issues, we classify the underlying 
optimization problem. In this context we give an exemplary visual proof for the non 
convexity of the model cases’ solution spaces. Furthermore, important parameters 
of threshold accepting are introduced. After describing the parameterization and the 
indication of the optimal allocation’s superiority to alternative allocation methods, 
the last part provides technical information on typical computations occurring in 
context with actual and future analyses on the basis of the model. 


2 Classification of the Underlying Optimization Problem 

The model’s underlying optimization problem belongs to the class of MINLPs. 
Similar to portfolio optimization, the present optimization of economic capital 
allocation has to face non linearity stemming from diversification effects among the 
securities’ returns. Furthermore, the model features the consideration of influences 
originating from the decision makers’ acting. Binary variables control value at 
risk (VAR) constraints . 1 These integer constraints cause the non convexity which 
represents the actual challenge of the present optimization. 

An exemplary visual impression of the model’s solution space surface gives 
Fig. 1 on the basis of three VAR-limits and decision makers 2 respectively. The 
x- and y-axis exhibit controlled limit quantities vl\ and v / 2 concerning business 
unit one and two building a 142/120-grid through increasing the limits stepwise 
by 1.57 EUR. In contrast, the size of the third unit’s limit vl 3 is adjusted in order to 
maximize the expected profit // han k(vl). 

The example uses a total VAR-limit of v4 ank = ec = 1,000 EUR which 
corresponds at the same time to the bank’s total amount of economic capital 
available. The only constraints therefore represent v4ank> vlj and the budget of 
investment capital Cbank = 150,000 EUR . 3 Under these conditions assigning VAR- 
limits vl\ = 546.1 EUR, v /2 = 393.4 EUR and v /3 = 615.9 EUR to the units induces 
the bank’s maximum expected profit of F-b ank (vl) = 55.4 EUR. 

In the example probabilities of success of p\ fa 0.55, /? 2 ~ 0.54 and /? 3 ss 
0.56 characterize the business units. Furthermore, the geometric Brownian motion 
(GBM) provides the multivariate returns used for modeling each business unit’s 


'See binary variables 4,ank,i and ; i n context with the description of the model’s MINLP in [1], 
pp. 542-543. 

2 The terms “decision maker” and “business unit” are used synonymously in the following. 

3 See Burghof and Mtiller [1], p. 542 for a notation of the used constraints. 
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P’bandvl) = 55.4 where 
W, = 546 1, vl 2 -393 4 and 
= 615.9 


266.3 


Fig. 1 Extract from an exemplary solution space surface generated by the model 
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traded security. 4 The corresponding standard deviations and correlations base on 
the return samples of three arbitrarily chosen S&P 500 stocks. 5 The business units’ 
and securities’ characteristics enter the Monte Carlo trading simulation with 20,000 
iterations which generates the bank’s profit and loss distribution Thank(W) required 
for the valuation of each occurring limit allocation. 

The surface of the resulting solution space exhibits many local extreme points. 
For finer grid structures, the consistency of the surface will change slightly but 
basically keep its many peaks and valleys. Pure gradient based optimization methods 
which first of all orientate themselves by the steepest ascent next to them will 
inevitably get stuck in the next best extreme point. In order to solve correspondent 
global problems, an algorithm has to be able to escape from extreme points which 
do not represent the global optimum. Furthermore, it has to feature probabilistic 
procedures in order to achieve a reasonable coverage of the entire solution space. 
Therefore, closed form solutions have to be combined with heuristics or even pure 
heuristic methods have to be applied. One appropriate heuristic method to solve the 
present financial optimization problem is threshold accepting. 6 


4 However, the implementation of the model does not require any assumptions concerning the 
distribution of the returns. The model enables the use of historical data in combination with 
bootstrapping methods. It considers skewness and kurtosis if existent in the data. 

5 We use the stocks MMM, ABT and ANF with returns from August 9, 2010 to August 5, 2011. 

s See e.g. [5-7] for introductions to threshold accepting and issues on heuristic optimization in 
finance in general. 
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1: parameterization 
2: for i = 1 to n reslarts do 

3: random current solution vl c , vl = W,.v/ nunits 

4: fory = 1 to n rounds do 

5: for k = 1 to n steps do 

6: generate vl n e A/(vl c ) and compute A = p b ank( v|n ) - F ba nk( v,c ) 

7: if A > tj then vl c = vl" 

8: end for 

9: end for 

10: ./ = /^bank(vl c ) and vl, = vl c 

11: end for 

12: vl' = vl, with /1 p banK , = max [ju bank ,.p bank 

n restarts ! 

Fig. 2 Threshold accepting pseudo code with key parameters 


3 Key Parameters of Threshold Accepting 

The functional principle of threshold accepting was already described in [1] 
on the basis of portfolio optimization under a VAR-constraint. Except for its 
final implementation, its basic principle stays the same with economic capital 
allocation optimization. The pseudo code gives an overview concerning the different 
parameters and parts of threshold accepting requiring specific setting (Fig. 2). 7 8 

There have to be determined the number of restarts ^restarts, the number of rounds 
n rounds responsible for the number of used thresholds, the number of optimization 
steps nsteps > the neighborhood function /V(vT) and the threshold sequence t. Since 
the last two are less intuitive we introduce them more detailed by the following. Due 
to differences between the implementation of threshold accepting in context with 
portfolio optimization under a VAR constraint and economic capital allocation, we 
finally point out differences of the two fields of application of threshold accepting 
in finance. 

The neighborhood function jV(v 1 c ) defines how a new solution vl n is generated 
from the current solution vl c . s In the model, an amount of one of two randomly 
chosen variables, here VAR-limits, is transferred to the other. The resulting solution 
is adjusted, if constraint violations occur. Alternatively, penalty terms could enforce 
the adherence to the constraints. 9 Unnecessary strong modification of the current 
solution vl c hinders the successive development of the solutions across the iterations 
and should therefore be prevented. 

However, the possible implementations of the neighborhood function N(\ l c ) 
are various since the implementation depends on the context of the application 
of threshold accepting. The case of portfolio optimization under a downside risk 


7 See also e.g. [4], p. 9 for threshold accepting pseudo code. 

8 See e.g. [4], pp. 8-9 concerning neighborhood function issues. 

9 See [5], p. 22 for remarks on the usage of penalty terms. 
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constraint with the focus on further integer constraints, for example, requires the 
transfer of a constant percentage value. 10 In contrast, investigations with the present 
model clearly suggest a non percentage transfer value drawn from a particular 
interval. We successively exponentially decrease this interval to zero across the 
iterations. Tests with percentage values remained unsatisfying. In order to ensure 
the quality and the efficiency of threshold accepting, we jointly calibrate the upper 
bound of the sequence’s initial interval q with quantile p determining the starting 
point of threshold sequence t which we introduce in Part IV. 

The thresholds t determine whether a current interim solution represents an 
appropriate basis for the generation of a new solution by the neighborhood 
function N ( vl c ). 1 1 The data for the threshold sequence t is derived from a ran¬ 
domly initialized sequence of neighborhood solutions and their valuations’ deltas’ 
absolute values. Subsequently we choose an interval up to a certain quantile 
p from the resulting values’ empirical distribution. The final « roU nds thresholds 
represent equidistant draws from the interval. 12 The corresponding procedure has 
the advantage of being data driven. Therefore, the thresholds orientate themselves 
by absolute deltas likely to occur in context with the current neighborhood function 
implementation which improves the convergence of the threshold accepting. 

As already mentioned, an appropriate implementation of the neighborhood func¬ 
tion strongly depends on the exact underlying application of threshold accepting. 
There are fundamental differences between the application in context with the 
referred to case of portfolio optimization under a VAR constraint and the current 
optimal economic capital allocation. 

The portfolio optimization generates long and short position weights which 
are consequently restricted to 100%. In contrast, the economic capital allocation 
generates VAR-limits allowing decision makers to build long and short positions 
autonomously under the condition of full limit usage. Thereby, the exact budget 
these limits can be generated from is unknown until the optimization ends. 

Finally, there can be the interpretation of a financial institution as one huge 
portfolio in order to be able to apply the common portfolio optimization under 
a downside risk constraint for economic capital allocation issues. Indeed, at first 
a position weight from portfolio optimization can be easily transformed into its 
corresponding VAR-limit. However, an autonomous decision maker provided with 
the limit would just be able to build the original position the VAR-limit was 
transformed from. Variation potentially causes violations of the total risk limit of the 
institution. Furthermore, variations are very likely to prevent the implementation of 
the highest profit expectations originally worked out by the portfolio optimization. 
These shortcomings of the VAR-limits from portfolio optimization originate from 
their lack in consideration of correlations’ instability resulting from the autonomous 
decision making of the business units. 


10 See [5], p. 21 suggesting a percentage transfer value of 0.2 %. 
u See [4], p. 9 introducing data-driven determined threshold sequence. 

12 Thereby, it might be advisable to choose the threshold of the last round being equal to zero. 
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Furthermore, implementation differences partly originate from the fact that the 
current model does not focus on further integer constraints since e.g. infinite 
divisibility of stocks is assumed. However, focusing on more realistic integer 
constraints with the current model would distract attention from the actual objective 
of investigating optimal economic capital allocation itself. 


4 Exemplary Case of Parameterization 

As an example the case from the surface diagram displayed in Fig. 1 is reused. 
Since threshold accepting’s parameters depend on each other the determination 
of optimal values can be difficult. However, there are findings that accuracy of 
threshold accepting behaves relatively robust concerning the variation of at least 
some of them. In order to reduce the complexity, simple grid searches concerning 
certain pairs of parameters are performed. One pair consists of 72 restarts and n steps , 
the other of neighborhood function’s q and threshold sequence’s p. Among these 
pairs, the first one offers a save opportunity to at first set its values just high 
enough to prevent hindering optimal parameterization concerning the other pair. If 
the underlying optimization problem exceeds the available computational resources 
from the start, the highest possible levels for n restarts and n steps are recommended. 

In order to get an idea which values should be set preliminarily for « res tarts and 
n ste P s there can be performed a rough grid search to narrow down their possible pre¬ 
liminary parameterizations. During this search we keep the neighborhood function’s 
q and the threshold sequence’s p unmodified. They should be initialized through 
moderate values from the middle of the corresponding intervals in order to prevent 
a complete inappropriate parameterization. Their potential intervals’ description 
below will give an idea of how to set reasonable dummy values. Furthermore we 
keep 7t roun d s =10 since the relevant literature describes the parameter’s variation 
as less decisive on the quality and the efficiency level of threshold accepting if 
77 iters = 77 st e P s x 77 rounds stays within a certain range. 13 

Results from the rough grid search for setting preliminary values for 72 restarts and 
72ste P s are displayed by Fig. 3. We find that 72 restarts = 100 most certainly represents a 
sufficiently high rate since already optimizations using only 72 restarts = 50 achieve no 
worse results. Additionally, the variation of 72 steps hardly influences the quality of the 
found maximum values for /rbank- However, the present extreme case of a very small 
model bank with only 3 business units almost consistently exhibits higher results for 
the lower 72 steps = 500. There are only little restarts with 72 steps = 1,000 achieving 
very good results and closing up to the results of 77 step s = 500. The following test 
reveals that a higher number of business units leads to the expected superiority of 
higher iteration rates. 


13 See [6] for remarks on the required number of rounds 77 rounds - 
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Fig. 3 Expected profits /Zbankfvl,), i = 1 .^restarts, ordered by size for a model bank with 3 

business units 
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Fig. 4 Expected profits /Zbankfvl;), i = 1, ..n res tarts- ordered by size for a model bank with 50 
business units 


The case with 50 business units from Fig. 4 confirms the generally expected 
higher performance of high values for «j ters and « steps respectively. Furthermore, 
the results fit and confirm the conclusion that 100 restarts can be assumed being 
more then sufficient to enable finding best solutions. This is because again already 
with n restarts = 50 the best solution lies only marginally below the best one from 
the case where « re starts = 100 is used. However, since the focus is on the special 
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case from Fig. 3 with only 3 business units the data suggests to set the preliminary 
values «restarts = 100 and « s teps = 500. Certain verification could be done through 
performing the grid search once more under the use of moderately changed values 
for q and p. 

After laying optimal foundations, the parameterization concerning neighborhood 
function /V(vl c ) and threshold sequence t can now be performed through setting 
corresponding initialization values q and p. In case of the threshold sequence 
initialization value p represents a quantile and therefore is derived from the interval 
[0, 1 ]. In contrast, as mentioned above, the determination of neighborhood solutions 
bases on a sequence of absolute transfer values randomly drawn from a particular 
exponentially decreased interval. Initialization value q in this case denotes the 
upper bound of the corresponding sequence’s first interval whereas zero denotes 
the lower bound. The interval [0, maximum VAR-limit] defines the potential upper 
bounds. 14 The maximum VAR-limit is the maximum limit allowed to be assigned to 
a single business unit and therefore is at the same time the maximum transfer value 
possibly occurring. In the present example 1,000 EUR determines the maximum 
VAR-limit meaning that there is virtually no maximum VAR-limit since this amount 
corresponds to v/bank- In cases where each business unit is provided with a different 
maximum VAR-limit, consequently the highest of occurring limits determines the 
upper bound. 

Figure 5 represents the findings of a 28/28-grid search on the described inter¬ 
vals of potential values for q and p. Each grid point was computed by using 
^restarts = 100, H s t e ps = 500 and n roun d s = 10. While Fig. 5 exhibits the best solutions 
out of wrestarts Fig- 6 bases on the corresponding average values. The latter diagram 
is particularly helpful for the identification of the efficient parameter combinations 
which are promising concerning reductions of «restarts, Esteps and therefore runtime 
without (too high) losses in solution quality. 

The analysis of both diagrams roughly suggests choosing initialization values 
q and p corresponding to 500 EUR < q < 1,000 EUR and 0 < p < l. 15 With 
n restarts = 100 and « s teps = 500 finally any value combination is capable of finding 
very good solutions as displayed by Fig. 5. From efficiency aspects however there 
should be chosen q > 500 EUR suggested by Fig. 6. Combinations where q < 
500 EUR can be expected to tolerate less reduction of « restarts and n ste p S until their 
best solutions suffer severe decline. 

Let us assume the chosen parameters are q = 750 EUR and p = 0.5. Under the 
use of this parameterization now « restarts, « steps and the runtime could be reduced 
through a similar grid search in order to enhance efficiency. Since ^restarts = 50 and 
n steps = 500 already proved appropriateness by Fig. 3, the investigation of lower 


14 In case there is also a minimum VAR-limit, maximum minus minimum VAR-limit represents 
this interval’s upper bound. 

15 A very close look at the data actually suggests 0 < p < 0.5 to further enhance the probability of 
finding very best solutions. Nevertheless, such high accuracy level is not required in context with 
the present application of threshold accepting. 
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Fig. 5 Findings for /r^ ank (vl) of the parameterization grid search for initialization values q and p 
of neighborhood function A(vl c ) and threshold sequence t 



Fig. 6 Averages of /Zbank(vl;), / = 1.... .^restarts! of the parameterization grid search for 

initialization values q and p of neighborhood function A(vl c ) and threshold sequence t 


value combinations e.g. on a 4/4-grind for a rough assessment appears advisable. 
In order to compare the performance of value combinations, now also the runtime 
has to be measured. The findings concerning n restarts and « steps depend on the 
number of used processors. Under the objectives of minimization of overall runtime 
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and simultaneous keeping of the solution quality the cluster size determines the 
advantageousness of reducing « res tarts in favor of n steps and vice versa. 16 

Furthermore, there might be the requirement for the parameterization to fit a 
wider scope of possible model cases, e.g. different numbers of business units or 
different constraints. However, the described parameterization procedure’s recom¬ 
mendations for n ^starts and h steps can just be expected to be robust concerning slight 
variations from the model case which was originally parameterized for. 

Increasing the number of business units increases the number of possible solu¬ 
tions exponentially and therefore increases the number of required restarts ^restarts 
and/or steps « steps . If 3 as well as 200 business units belong to the scope of 
possible optimizations, a new parameterization procedure had to be performed 
(automatically) in front of every optimization. In this case the procedure had to 
slim down to become proportional in computational effort compared to the actual 
optimization without too much loss in the parameterization’s quality. 

Alternatively, particularly « re starts and « s teps could be chosen just high enough 
so that most complex as well as simpler model cases could be faced by the 
same parameterization without being concerned about inacceptable low quality 
outcomes. The latter is accompanied by noticeable losses in the computations’ 
overall efficiency but is safe concerning the solutions’ quality and convenient to 
implement, sufficient computational resources provided. 

In contrast to ^restarts and /2 ste ps, there is the suspicion that the recommendation 
for q and p is quite robust concerning the variation of the model case, especially 
for p. The parameterization of q is expected to be at least robust if the interval for 
potential (/-values reflects the scope of application. If there is the plan to process 
many different model cases under the use of the same (/-value, its parameterization 
should consequently orientate by the highest used maximum VAR-limit among the 
occurring model cases. 


5 Superiority of Optimal Economic Capital Allocation 

The superiority of the optimal economic capital allocation is analyzed by comparing 
optimal allocation with two alternative allocation methods. 17 We expect all three 
tested methods’ performances to depend on the model bank’s level of diversification. 
Therefore, we perform economic capital allocations for 10 different model banks 
with 10, 20, ..., 100 business units. The parameterization consists of ^restarts = 
100 , /zsteps = 1,000, nrounds = 10 , q = 750 EUR and p = 0.5 which proved its 
appropriateness in the previous paragraph. 

For a description of the optimal allocation method we refer to [1]. The simpler 
one of the alternative methods allocates the economic capital uniformly among the 


16 See [6], pp. 32-33 addressing cluster size issues in context with threshold accepting. 
17 We refrain from presenting more alternative methods here, to keep the remarks short. 
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Fig. 7 Indication of superiority of optimal economic capital allocation for differently diversified 
model banks and number of business units respectively 


business units and exclusively among those exhibiting positive return expectations. 
The more sophisticated alternative uses the units’ weighted return expectations in 
order to derive limit assignments whereas the method also exclusively considers 
units with positive return expectations. 

The total amounts of economic capital and investment capital remain constant 
at ecbank = v/bank = 1,000 EUR and Cbank = 150,000 EUR for every method and 
bank size to provide equal opportunities. The business units’ characteristics include 
certain probabilities of success and traded securities’ returns similar to the case 
from Fig. I. 18 These parameters enter a Monte Carlo trading simulation with 20,000 
iterations. 

While the left diagram of Fig. 7 displays the expected profits of the different 
methods and model banks, the right one plots the corresponding risk in form of value 
at risk (VAR). The results confirm the superiority of the comprehensive optimal 
allocation. Less sophisticated allocation methods are incapable of keeping up the 
exposure to risk across the stronger diversified banks entailing decreases in profit 
expectations. Across the displayed unrealistically low diversified banks, the optimal 
allocation only slightly outperforms the alternatives. 

The alternative method orientating by the weighted business units’ profit expecta¬ 
tions at least clearly outperforms the method which just assigns uniform limits. Both 
alternatives leave a noticeable part of the bank’s total VAR-limit v/bank = 1,000 EUR 
unused which is inefficient. 


6 Technical Information on Typical Computations 

Typical computations result particularly from analyses where model parameters (not 
threshold accepting parameters) are varied in order to learn about the parameters’ 
impact on optimality of the economic capital allocation. A small exemplary 
case was represented in the previous Part V. Corresponding analyses require the 
computation of a new optimal economic capital allocation per variation. As a result 
typical computations consist in sequences of optimizations. In cases where several 


18 Exact characteristics of business units are irrelevant and therefore remain unlisted. 
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Table 1 Technical 
information on typical 
computations with the NEC 
Nehalem cluster 


Computations 

A a 

B b 

Model cases 

10 c 

1 

^restarts /Jobs 

1,000 

78,400 

^rounds 

10 

10 

^ steps 

1,000 

500 

p 

0.5 

0.5 

q 

750 EUR 

750 EUR 

Sequential 

269.7 h 

1,071 h 

Parallel 

8.7 h 

4.2 h 

Nodes 

4 

32 

Processors 

32 

256 

Servants 

31 

255 


“Superiority of economic capital allocation 
b Threshold accepting parameterization 
c Ten different bank structures with 10-100 business 
units 


parameters are varied simultaneously the number of required optimizations per 
computation increases exponentially. 

For example, we analyze the impact on the superiority of optimal allocation 
for different levels of information of the model bank’s central management. The 
different levels result from varying the comprehensiveness of the management’s 
Bayesian learning concerning the characteristics of the decision makers. 19 A follow- 
on investigation, for example, additionally includes a further decision makers’ 
property of herding and entering informational cascades which requires a rerun of all 
computations. 20 Finally, there are computations in context with threshold accepting 
parameterization issues according to paragraph 4. Table 1 gives some technical 
information about exemplary computations. 

With the underlying implementation of threshold accepting one job corresponds 
to one restart. This job size represents the smallest possible without using inter¬ 
servant communication. So far, further decomposition remained disregarded. The 
implementation of decomposition requiring inter-servant communication would be 
challenging and not necessarily very advantageous because of potential parallel 
overhead through increased communication and idle times. The so far used master- 
and-servant-structure is a very classical one outlined by Fig. 8. 

The experienced parallel overhead in context with the current implementation 
is marginal and insignificant compared to the runtime reductions gained from 
parallelization. Dynamic load balancing minimizes idle times through allocating 
the jobs efficiently among servants by sending a new job as soon as a servant has 


19 See [2], pp. 205-210 for the modeling of a Bayesian learning central planner introduced in 
context with a preliminary stage of the current model. 

20 See [3] for issues on herding and informational cascades in context with economic capital 
allocation. 
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Fig. 8 Applied classical master-and-servant structure for parallel computing 


finished one. We use synchronized communication between master and servants 
through blocking send and receive. The servants pass their output data to the master 
who selects and rearranges it immediately. As soon as the master has received 
the data of the last active servant he starts certain follow-on analyses before the 
computation ends. The model is written in C+-I- whereas the parallel computing 
parts are implemented on the basis of the Intel message passing interface (IMPI). 
For compilation the Intel compiler is used. 


7 Conclusions 

The present model of optimal economic capital allocation entails an effective 
implementation of the threshold accepting heuristic. Thereby, the appropriate 
solving concerning the model’s underlying non convex optimization problem is 
enabled. Implementation of threshold accepting in context with the model requires 
certain modifications also affecting the heuristic’s parameterization. Therefore, the 
parameterization’s requirements were examined in order to ensure the validity 
of the model analyses’ results. Besides the implementation of the heuristic itself 
the model’s effectiveness stems from its ability to comprehensively apply parallel 
computing in order to obtain problem solutions. 

The so far analyses indicate the superiority of the optimal economic capital 
allocation compared to the alternative allocation methods within the framework 
of the model. Thereby, the difficulties of economic capital allocation, for example 
in form of unstable correlations stemming from the decision makers’ autonomous 
decision making, are considered. 

Further analyses on the superiority of optimal economic capital allocation will 
address the case of an uninformed central management. During these investigations 
the central management uses Bayesian updating to learn about the characteristics 
of the decision makers and their trading activities. Another field of investigation 
will be the consideration of correlated decision making in context with herding and 
informational cascades and the corresponding consequences for optimal economic 
capital allocation. 
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Distributed FE Analysis of Multiphase 
Composites Regarding 3D Elasticity Problems 


Kai Schrader and Carsten Konke 


Abstract Today the numerical simulation of damage effects in heterogeneous 
materials is done by adaption of multiscale approaches. A consistent modeling in 
three dimensions with a high discretization resolution on each scale based on a 
hierarchical or concurrent multiscale model still has issues. The algorithms have to 
be optimized in regards to the computational efficiency and the distribution among 
available hardware resources often based on parallel hardware architectures. In 
the last 5 years high performance computing (HPC) as well as GPU computation 
techniques were established for investigation of scientific aims. Consequently, in 
this work substructuring methods for partitioning of FE meshed specimens were 
implemented, tested and adapted to the HPC computing framework using several 
hundred CPU nodes. An memory-efficient iterative and parallelized equation solver 
combined with a special preconditioning technique for solving the underlying 
equation system was modified and adapted to the consideration of combined CPU 
and GPU based computations. 


1 Introduction 

Modern approaches of discretization methods such as the finite element method 
(FEM, [1]) approximates partial differential equations which have to be solved 
numerically. Today in (material) engineering science, many investigations of mate¬ 
rial behaviour in 3D, e.g. the damage initiation and propagation at different scales, 
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are based on complex and computationally expensive numerical simulations. Life¬ 
time assessment of engineering structures is relying on sophisticated material 
models, integrating all different aspects of damage initiation and deterioration over 
the expected life-time of a structure. Therefore, the current material models in 
engineering applications are integrating modern approaches from material science 
via multiscale methods. Especially for heterogeneous materials, these multiscale 
approaches allow a detailed insight into the material physics on appropriate 
scales [2], A major drawback of multiscale methods is the tremendous increase 
in degrees of freedom (d.o.f.s.) in the resulting equation systems when studying 
models on meso- or microscale [3,4]. In the framework of damage simulations, the 
incremental-iterative approach requires repeated solving of the linearized equation 
system [5], leading to an even more crucial computing time consumption. The 
key idea of the approach developed in this work is the application of an iterative 
distributed solver technique for the solution of the approximated Navier partial 
equations which are arising from these computational elasticity or inelasticity 
multiscale problems. In recent years, the distributed computing based on the 
message passing interface standard (MPI), [6] has been proven, which enables the 
distributed computation of linear equation systems utilizing as many computational 
nodes which are available in a high performance computing framework [7] with 
the widespread establishment of enormous hardware ressources. Furthermore, 
during the past years, hybrid CPU-GPU architectures were developed, using 
graphics processing units (GPU) which enable the high-scalable implementation 
of routines of linear algebra. Therewith, the parallel code execution using different 
hardware architectures (based on CPU and GPU simultaneously is possible [8]. 
Consequently, a memory-advantageous iterative MPI-solver strategy based on the 
conjugate gradient method (CG) was choosen, and accelerated by an efficient 
preconditioning technique. The parallelization technique is based on a standard 
nonoverlapping domain decomposition method for FE problems combined with 
the elastic-inleastic domain split. Hence, the implementation takes different parallel 
hardware architectures into account, as well as the application in high performance 
computing centers. For that reason the developed algorithms for the linearized step 
have been evaluated considering the hybrid CPU-GPU NEC Nehalem cluster [9] 
or the new petaflop system Cray XE6 at the high performance computing center 
Stuttgart (HLRS). 


2 Continuum Mechanics 

The mechanical behaviour of a material continuum discretized into infinitesimal ele¬ 
ments with bulk material properties can be described by fundamental formulations 
of the continuum mechanics (Fig. 1). 
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Fig. 1 Illustration of the 
motion of a material point 
from the reference to the 
momentary position 



2.1 Kinematics 


Therewith, the motion of a single material point of the continuum changing the 
position from the reference to the momentary position can be described by the 
equation 

x = 4>(X;t) = x(X) (1) 

The corresponding material deformation gradient is defined by 


F = 



( 2 ) 


which causes the transformation of a material line element 3x in the reference 
position to a material line element 3X in the momentary position by 


3x = F3X 


(3) 


in which F in general is an unsymmetric tensor. The material penetrability will be 
avoided by introduction of the condition 

J = det(F) > 0 (4) 

The determinant J of the deformation gradient F describes the volume ratio between 
reference and momentary configuration of a differential volume element deformed 
by F with 
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Additionally, F enables the derivation of different deformation measurements. The 
right Cauchy-Green deformation tensor yields to 

C = F t F (6) 

as well as the Green-Lagrange strain tensor (with Identity I) 

e =^c-D 

The left Cauchy-Green deformation tensor (Finger tensor) is defined by 

b = FF t 

With b the Hencky strain tensor can be formulated by 

s = ^ln(b) (9) 


(7) 

( 8 ) 


2.2 Stress Tensors 


The Cauchy stress tensor T induces a infinitesimal surface force related to the 
deformed momentary configuration of a surface element. Based on T several stress 
measurements can be defined. Therewith the Kirchhoff stress tensor results in 


r = JT 

The material stress tensor is defined by 

S = F _ 1 TF“ t 


and consequentely 


T = FSF t 

The first Piola-Kirchhoff stress tensor yields to 

T Pl = JTF t 


as well as the symmetric second Piola-Kirchhoff stress tensor is given by 

Tp 2 = F _ 1 T Pl = JF _ 1 TF t 
And finally the Mandel stress tensor results in 


( 10 ) 

( 11 ) 

( 12 ) 

(13) 

(14) 


M = SC = F _1 TF 


(15) 
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2.3 Constitutive Relations of Linear Elasticity 


The constitutive relation between the Green-Lagrange strain tensor C and the second 
Piola-Kirchhoff stress tensor is defined by 

Tp 2 = CE (16) 

also known as Hook’s law for linear elastic materials with C as the fourth order 
elasticity tensor. The linearization of the Green-Lagrange strain tensor results in the 
linearized strain tensor e 

s = linE (17) 

The linearized stress tensor a results in 


a = Ee 


(18) 


with E as the elastic material matrix. Considering stress and strain components 
Eq. (18) can be written as 





Cfyy 


Syy 

®zz 

= E 

S ZZ 

txy 


Yxy 

T>xz 


Yxz 

\ x yz) 


\Yyz) 


For the isotropic case the material elasticity matrix considering the material 
properties v (poisson’s ratio) and E (Young’s modulus) can be formulated as 


E = 


E 

(1 + v)(l -2v) 


/l - v v v 0 

v 1 — v v 0 

v v 1 — v 0 

0 0 0 

0 0 0 0 

v 0 0 0 0 


0 0 \ 
0 0 
0 0 
0 0 
0 

0 


( 20 ) 


2.4 Extension to Material Inelasticity 

According to the classical plasticity formulation [ 10] the linearized strain tensor will 
be splitted in an elastic and an inelastic part with 


e = e tot = £ el + £ pl 


(21) 
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The linear elastic constitutive relation regarding Hook’s law of Eq. (18) is modi¬ 
fied to 

rr = E(e tot - e pl ) (22) 

The evaluation of the stress state (elastic or inelastic) will be performed by using 
a defined yield condition f(a) which depends on the actual stress state o and 
the material specific yield stress ay as a scalar value. In general / (without 
consideration of hardening effects) can be expressed as 


/(o') = |cr| -ay = 


< 0 elastic 
= 0 inelastic 


(23) 


Since the evaluation of the /(a) for multiaxial stress states of the inelastic material 
behaviour does not result exactly in zero, mapping techniques, such as return 
mapping methods, were developed for the computation of the inelastic stress and 
strain components. The equivalent von-Mises stress is defined as 


oy = 73 • h (24) 

Hence, considering the second invariant of the stress tensor a (with S\j as compo¬ 
nents of deviatoric part cr dev ) which is given with 

I2 = 2 h./' ‘b'./ = Tt y~ T ^ y z 3“ T zx ~ (OxxOyy + Oyy (7 ZZ + tJ zz G xx ) (25) 

the von-Mises yield condition for isotropic material results in 


/( CT ) = ] ct | - Oy = y/3 ■ h ~ Oy (26) 

in which ay, e.g. describes the tensile strength as a material property used in case 
of uniaxial tensile loading conditions. The introduction into basic constructs of 
continuum mechanics will be complemented by some fundamental formulations of 
the finite element method (FEM) given in the following section. 


3 Finite Element Method 

The finite element method [1] describes an approximative solution technique for 
partial differential equations resulting from the discretization of different physical 
problems. Starting with a 3D elasticity problem '¥ which is bounded by domain Q, 
this problem can be described by the equillibrium equation 


Oijj - h, - 0 


(27) 
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with 

a ij = CijklSkl, Ski = ^(Uk.l + Ul,k) = V s (u)kl 

The boundary conditions are subjected to 


(28) 


— i C2 U £2 U n — 0 


(29) 


and 


G[j fi j = ti on £2 t Hi = Uj on f2. 


(30) 


Here ct ;j are the components of the stress tensor, the components of the body force 
and iij the unit outward normal. [/,• are the displacements and Cjjki is the material 
tensor. The weak form of Eq. (27) with zero body force is given as 



(31) 


4 Preconditioning Techniques for the Conjugate 
Gradient Method 

4.1 Preface 

The conjugate gradient method (CG) can be used as an iterative and distributed 
solution procedure [11] to solve a symmetric, regular and positiv definite system 
of equations [12], which are arising from the FE approximation of the discretized 
problem. The preconditioning technique for CG [13] is crucial according to memory 
demand and computing time for the preconditioner as well as the computing time 
for the CG iteration procedure itself to solve a linear system of equations such 


Ku = f 


(32) 


where K is the global stiffness matrix, u the nodal vector of the displacements and 
/ the nodal vector of external structural forces. In this work a parallelized version 
of the preconditioned conjugate gradient method (PPCG) was implemented based 
on nonoverlapping domain decomposition [14,15] without building explicitly the 
Schur complement system [16]. Secondly, the preconditioning technique will be 
restricted to a scaled maindiagonal precondition technique with a special scaling 
parameter based on the upper or lower bound of the spectral radius of the assembled 
submatrices. This reduces the time for building the preconditioning matrix as well 
as the memory demand and, additionally, the time for the execution of matrix-vector 
products involving the preconditioning matrix. Finally, the sparse matrix-vector 
product performed for each subdomain will take into account different matrix 
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storage formats. Due to the nodally storage scheme of the distributed FE data a 
nodal compressed row storage can be used to improve the performance compared to 
standard coordinate storage (coo) or compressed row storage (csr or crs). 


4.2 Modified and Parallelized Jacobi Preconditioning 

The easiest and most memory-efficient preconditioning technique which only 
involves the inverted main diagonal of the corresponding matrix, which has to be 
preconditioned, is known as the so-called Jacobi point preconditioner. The paral¬ 
lelization based on the domain decomposition method simultaneously considers the 
main diagonal of each domain matrix. The modified preconditioning technique to 
obtain M * per domain results from the scaled main diagonal or the scaled block 
diagonal of the global domain matrices 

0 \ U) = (diag[K u ] 0 \ 0) 

V 0 M bb J \ 0 diag[K hb \) 

The necessary assembly (, MPIJdlreduce ) regarding the connecting boundary h 
results in 

nd 

M hh = J2 M ib (34) 

7 = 1 

Inversion and modified scaling yields to 

««>-=»K’ °S=«,M K >v j , . 1 V ji ( 35) 

\ 0 <'7 \ 0 M tl 7 

where the scaling parameter w is been approximated by using an efficient eigenvalue 
strategy for the main diagonal blocks of the stiffness matrices of all subdomains. 


5 Numerical Results 

5.1 NEC Nehalem Cluster: Voxel-Based Microstructural 
Specimen 

The numerical analysis, based on the implemented parallel algorithms respecting a 
large-scale 3D microstructural bone specimen, which is illustrated in Fig. 2 (include 
several millions of degrees of freedom). Table 1 contains the information about 
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Fig. 2 3D poriferous bone specimen as FE structure based on voxel data: Perspective view (left), 
load-balanced decomposed FE structure in four nodally equal-sized domains, 8.9 million d.o.f.s 


Table 1 Dimensions of the decomposed FE problem: Number of coupled 
nodes (ncn), number of nodes (nn), number of non-diagonal nodal fe blocks 
and memory demand for matrix storage in gigabyte by domain j 


j 

ncn 

nn 

nnb 

mmem 

1 

5,108 

488,783 

5,418,438 

0.547 

2 

5,896 

487,927 

5,548,558 

0.546 

3 

4,399 

489,546 

5,545,580 

0.554 

4 

3,198 

488,472 

5,570,790 

0.550 


the dimension of the decomposed problem such as number of elements, number of 
nodes, number of resulting d.o.f.s, number of nodal FE blocks and memory demand. 
The voxel model is converted to a regular grid based on hexahedral elements with 
linear shape functions. The nodal partitioning [17] of the hexahedral mesh, e.g. in 
four subdomains, is illustrated in Fig. 2 (right). The numerical tests including the 
building and assembling of the global stiffness matrices as well as the computation 
of the equation system in parallel with the preconditioned conjugate gradient 
method, scaled upto 64 MPI processes (equal to the number of subdomains). 


5.1.1 Benchmarking: Multiple CPU-Nodes 

Figure 3 presents the scaling of the computational time in relation to the assembly 
of the global stiffness matrix with a increasing number of subdomains including the 
numerical integration of the finite elements. Figure 4 contains the total times needed 
for solving the equation system with the parallelized PCG method. Fig. 5 illustrates 
the scaling of the computational time only considering the computation of the 
sparse matrix-vector product during the PCG iterations, respectively. Figure 6 shows 
the scaling of the iterative solver excluding the times for the sparse matrix-vector 
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Number of subdomains 



Fig. 3 NEC Nehalem cluster (CPU): Total computational time for the parallel assembling 
of global stiffness matrices (including the numerical integration) with increasing number of 
subdomains, 8.9 million d.o.f.s 


ioooo 
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Nunber of subdomains 


Number of subdomans 


Fig. 4 NEC Nehalem cluster (CPU): Total computational time based on the parallelized precon¬ 
ditioned conjugate gradient method, 8.9 million d.o.f.s 



1 2 4 e 16 24 32 40 48 56 64 J28 

Number of subdomans 



Fig. 5 NEC Nehalem cluster (CPU): Accumulated time for sparse matrix-vector operations of the 
PPCG method, 8.9 million d.o.f.s 


product. Finally, Fig. 7 presents the computational times caused by the MPI 
communication overhead between different MPI processes which is mainly induced 
by the MPIJdlreduce operation during the PPCG iteration. There are moderately 
increased the more subdomains are being used. 
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Number of subdomains 



Fig. 6 NEC Nehalem cluster (CPU): Scaling of accumulated computational times for non-matrix- 
vector operations of the PPCG method, 8.9 million d.o.f.s 


1000 



1 2 4 8 16 24 32 40 48 56 64 

Number of subdomains 



Fig. 7 NEC Nehalem cluster (CPU): Scaling of accumulated computational times for the MPI 
based communication of the PPCG method, 8.9 million d.o.f.s 


5.1.2 Benchmarking: Hybrid Multiple CPU-GPU Nodes 

Additionally, we have implemented a hybrid parallelization technique for the PCG 
method using the CPU and GPU in a combined fashion suitable on up to 16 T 
nodes of the Nec Nehalem cluster. The graphical Tesla subsystem is based upon 
the Nvidia Tesla S1070 GPU. Each MPI thread will have access to and execute 
the sparse matrix-vector product of the PCG method on 1 T unit. The coo storage 
format for the MPI-distributed stiffness matrices of each subdomain and their 
allocation on the GPU has been used. Then, the subdomain matrix will be distributed 
over the maximum number of available GPU threads per GPU which enables a 
simultaneous sparse matrix-vector multiplication per matrix block and GPU thread. 
The results of the hybrid model are compared to the computational times for the 
spmv operation on the CPU (Fig. 5) nodes, as illustrated in Fig. 8 for different 
storage formats, and in Fig. 8 for different data transfer techniques from and to 
the host memory. The computational times of the hybrid model including the 
times for the memory transfer from CPU host to GPU device and from GPU 
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Fig. 8 NEC Nehalem cluster (CPU-GPU): Accumulated computational times for sparse matrix- 
vector operations of the PPCG method using the CPU-only and hybrid CPU-GPU cluster, 
8.9 million d.o.f.s 




Numtwr of CPU-GPU urws (TesJa) 


Number of CPU-GPU urms (Tesla) 


Fig. 9 NEC Nehalem cluster (CPU-GPU): Accumulated computational times for sparse matrix- 
vector operations of the PPCG method using CPU-only and hybrid CPU-GPU cluster with 
synchroneous (hybrid-sync), asynchroneous (hybrid-async) and mapped memory (hybrid-mapped) 
CPU-GPU data transfer for the coo matrix storage format 


device to CPU host, respectively, before and after the spmv computation. Hence, 
synchroneous (hybrid-sync), asynchroneous (hybrid-async) and mapped memory 
(hybrid-mapped) data transfer are compared with the computational times in respect 
to the cpu-only spmv computations considering the coo matrix storage format as 
shown in Fig. 9. 


5.2 Cray XE6 Cluster: 3D Large-Scale Nickel-Alloy Specimen 

The second high performance computing framework is based on the Cray XE6 
cluster ‘Hermit’ at HLRS Stuttgart which is in production mode since the beginning 
of 2012. With the computational power of over (3,552) computing nodes, where 
each node consists one AMD Opteron Interlagos, it is possible to scale parallelized 
code up to several ten thousand cores with a peak performance in total of nearly one 
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Fig. 10 Multiphase Ni-alloy specimen with pore structure, voxel model based on computer- 
tomographic scans 




Fig. 11 Cray XE6 cluster: Total computational time for the parallel assembling of global stiffness 
matrices (including the numerical integration) with increasing number of subdomains, 42.8 million 
d.o.f.s 


petaflop. The AMD Interlagos processor, basically a 32-core x86-64 architecture, 
was introduced in 2011. The second example is based on a large-scale voxel 
discretization of a nickel-alloy specimen (Fig. 10). The numerical effort of this 
large-scale FE specimen are given with 

• 14,093,177 hexaedral elements 

• 14,292,274 FE nodes 

• 42,876,822 nodal unknowns or d.o.f.s 

The scaling starts with one single node and is increased to 256 physical cpu 
nodes. The following table shows the quantitative values for storing the FE data, 
memory demand as well as sequentially computing times which are necessary 
for the matrix assembly. Figures 11-14 illustrates the scaling behaviour according 
to FE assembly, PCG computation, matrix-vector computation, non-matrix-vector 
computation and MPI communication effort. Furthermore, a 4 byte data transfer 
for MPI communication was implemented, which improves moderately the scaling 
behaviour, especially if more than 64 subdomains are being taken into account 
(Table 2). 
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Fig. 12 Cray XE6 cluster: Total computational time based on the parallelized preconditioned 
conjugate gradient method, 42.8 million d.o.f.s 
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Fig. 13 Cray XE6 cluster: Accumulated time for sparse matrix-vector operations of the PPCG 
method, 42.8 million d.o.f.s 



Number or sum vnsra 



Fig. 14 Cray XE6 cluster: Scaling of accumulated computational times for the MP1 based 
communication of the PPCG method, 42.8 million d.o.f.s 


Table 2 Cray XE6 cluster: Quantitative values of total (3x3) FE blocks, the total number of block 
entries, memory demand and assembly time for the sequential case, 42.8 million d.o.f.s 


Storage 

FE blocks total 

Block entries total 

Memory (GB) 

Seq. assembly (s) 

COO 

184,005,692 

1,061,464,238 

17.095 

3,079 

ndcsr 

184,005,692 

1,061,464,238 

12.977 

3,001 
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6 Outlook 

In actual research work the computing model will be extended for the consideration 
of damage effects in multiphase composites. The resulting iterative analysis will be 
realized in a nonlinear simulation framework for high performance computers. 
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Reduction of Numerical Sensitivities in Crash 
Simulations on HPC-Computers (HPC-10) 


Christiana Eck, Oliver Mangold, Raphael Prohl, and Anton Tkachuk 


Abstract For practical application in engineering numerical simulations are 
required to be reliable and reproducible. Unfortunately crash simulations are highly 
complex and nonlinear and small changes in the initial state can produce big 
changes in the results. This is caused partially by physical instabilities and partially 
by numerical instabilities. Aim of the project is to identify the numerical sensitivities 
in crash simulations and suggest methods to reduce the scatter of the results. Work 
has been undertaken to identify sources of sensitivities through parameter studies, 
to improve existing mathematical formulations, e.g. of contact-impact and material 
models. Furthermore a tool is developed for generation of code from mathematical 
descriptions of finite elements with the aim to reduce effort required to make 
modifications of implementations of FEM models. 
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1 Identification of Sources of Sensitivity Through 
Parameter Studies 

In simulation one obtain often different results for identical input values. To 
distinguish whether this scatter is physical or numerical nature and to eliminate or 
rather reduce numerical scatter is the aim of this research. To achieve this purpose 
parametric studies had been accomplished in the past with the commercial software 
packages Radioss and LS-Dyna. 

Investigations with these general-purpose-programs have been expanded. In 
addition, further studies in Pamcrash have been conducted. As in the previous 
investigations the following parameters have been examined in Pamcrash: time step, 
mass-scaling, strain rate, rate filter, damping, contact-stiffness, stiffness, contact, 
element refinement, element formulation. 

One can discover that in all general-purpose-programs scatter behavior is influenced 
by the same factors. In particular, this include damping, contact-stiffness, strain rate 
and rate filter. 

Due to the fact that in Pamcrash a lot of parameters are predetermined, the 
approach of the examination has been changed. Moreover there is a given set 
of parameters which is based on experience. All calculations have been executed 
starting with this set of parameters. Following this practice the influence of the time 
step reduction is negligible. Moreover, the scatter is less therefor one need a measure 
to compare the results. 

To appropriate a dimension for the scatter for each function f], e /' M , M = 
{1 n e N and Jm : T x F —> R., for T time and F force, in the time-force 
diagram the formula 



has to be computed and subsequent the Euclidean norm has to be determined 



( 2 ) 


\l Je T 


These result has to be weighted in a suitable way and accordingly a median has to 
be calculated. 

The sensitivity analysis has been accomplished on a very basic model so far. The 
model consists of a tube and a plate. For the sensitivity analysis several calculations 
with little change of initial velocities have been accomplished for various parameter 
settings. 

Figure 1 shows a force-strain diagram. It illustrates the impact of strain rate 
behavior. The left diagram displays a computation with strain rate dependency and 
the right diagram one without strain rate dependency. Both calculations have been 
conducted without rate filter. 
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Fig. 1 Force-time diagram. Computations with strain rate- {left) without strain rate-dependency 
{right) and in both cases without rate filter 



Fig. 2 Basic model computed frequently with the same parameters and the same initial velocities. 
The difference of the computations in detail is the distance between the plate and the tube 


Moreover it was investigated whether the scatter is physical or numerical. One 
possibility to find out, is to compute frequently exactly the same problem. Numerical 
scatter would appear in each calculation, physical scatter probably not. Figure 2 
shows results of several calculations performed with the same parameters and the 
same initial velocities. The only difference is the distance between the plate and the 
tube (the distance increase in each further calculation). This assures that the same 
problem is computed, merely time shifted. The result implies that the scatter has to 
be physical. 

The next step will be to expand the findings of the simple model on whole- and 
part-vehicle model, respectively. Figure 3 shows first results of a part-vehicle model 
computed with under- and fully integrated elements. 
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Fig. 3 Part-vehicle-model 
computed with under- (above) 
and fully integrated elements 



2 Improving Robustness of Contact Discretization Using 
Hybrid Singular Mass Matrices 

Nowadays the majority of crash simulation are done using explicit finite element 
codes. In presence of contact-impact, buckling and non-elastic material response, 
explicit codes rarely fail computations. In contrast implicit codes, which rely on 
predictor-corrector methods, face problems with convergence of corrector step and 
may fail. However, further refinement of meshes in crash simulations requires very 
small time-steps for explicit codes. As an alternative, robuster implicit space-time 
discretization schemes must be developed. 

A novel spacial discretization of elastodynamic contact is proposed. It can be 
used both for bulk and thin-walled structures. The main idea of this discretization 
is to split dynamics and contact such that contact is collocated for a set of 
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Fig. 4 3-node line element 





Y Y 

contact collocation 


massless nodes whereas all inertia properties are condensed at the other nodes. This 
substantially improves conditioning of the problem reducing the differential index 
from 3 to 1 [2], Usage of such mass redistribution improves robustness of time 
integration. It also leads to substantial reduction of oscillation in contact pressures 
[1,3]. 

Variational framework that allows such mass redistribution relies on 3-field 
Hamiltonian’s principle with independent variables for displacement, velocity and 
momentum [5, Appendix I]. Their spatial discretization together with a local 
elimination procedure delivers Hybrid-mixed mass matrices. If discretization spaces 
are specially built entries of mass matrix for certain nodes vanish. Such mass 
matrices are called Hybrid Singular Mass Matrices and were proposed first for 
triangular elements in a paper by Renard [4]. 

For example a 3-node line element uses corner node for contact conditions and 
the middle, inner node carries entire mass of the element, see Fig. 4. A simple setup 
for a benchmark and its results are given on Fig. 5a. A 2D Timoshenko beam hinged 
at both end bounces from a rigid obstacle. The quality of contact forces improve, 
see Fig. 5b. 

Such split can be also found for 9-node quadrilateral and 27-node hexahedron 
elements if tensor product formula for the shape function is used, see Fig. 6. This 
elements are under testing. 

This method can be proposed for general contact-impact problems. Deep drawing 
and crash of individual components may be seen as a benchmarks for study of 
compatibility of the approach with explicit codes. This will be covered in future 
work. 


3 Development of Robust Algorithms in Finite 
Elastoplasticity 

The simulations of crash tests are very sensitive numerical procedures, which 
require very accurate and robust algorithms in order to handle numerical instabili¬ 
ties. In addition to these, also physical instabilities, which may be due, for example, 
to the bifurcation behavior of the material under asymmetric loading, have to be 
detected and correctly treated by the numerical procedures in order to obtain reliable 
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Fig. 5 Bounce benchmark for a Timoshenko beam, (a) Benchmark setup and (b) contact force at 
middle node computed with standard mass matrix (left) and sungular mass matrix (right) 



0 - node with mass O - massless node 


Fig. 6 9-node quadrilateral element 


results. In this context, our task is to identify and eliminate numerical instabilities 
and improve the mathematical analysis of the physical ones. 

For our purposes, we firstly formulated simplified versions of some full-vehicle- 
models used by automotive industries. By “simplified versions” we mean that our 
models are purely mechanical, take into account only a selected number of physical 
effects and are implemented for relatively simple (unrealistic) geometries. These 
preliminary steps are necessary for providing computationally cheap benchmark 
problems to be used for testing numerical codes and improving their functionalities, 
when needed. With this attitude, we focused on the stability of some algorithms 
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used in crash test simulations. In particular, we closely investigated the return- 
mapping-algorithm [7]. By reviewing this computational procedure in the context 
of nonlinear programming [9], we developed a new class of solution methods with 
better mathematical properties. 

Our next step is to adopt the methods exposed in [6] in order to include large- 
deformation contact in our problem. After the stability of our numerical algorithms 
has been proven, the goal is to formulate a reliable criterion for identifying physical 
instabilities within a crash-simulation. 


3.1 Reduced Model for Finite Elastoplasticity 

In our mathematical model of crash tests, we consider large elastic and plastic strains 
as well as nonlinear material behavior. Initially we concentrate on geometric and 
kinematic nonlinearities as well as on nonlinear material properties and elastoplastic 
material behavior. 


3.1.1 Balance Laws 

We consider the balance of linear momentum in local, material form: 

-DIV(FS) = p 0 B in 12 (3) 

u = d on T c i 

In (3), F denotes the deformation gradient, S is the second Piola-Kirchhoff stress 
tensor, po is the reference density, B is the body force per unit mass, and d represents 
the prescribed displacement on the Dirichlet boundary of the computational domain. 
For the first part of our study, inertial terms are not accounted for the force poB. 
Apparently, dropping inertial terms in a mathematical model that should be the 
basis for crash test simulations may sound as a strong contradiction. However, 
our task here is to analyze and compare algorithms that should be applied in the 
numerical simulations of crash tests. In this respect, inertial terms are, at this stage, 
only “temporarily” switched off. This saves computational resources when different 
algorithms are compared, and allows for focusing on the possible numerical 
instabilities that, hidden behind a given algorithm, may exist independently on the 
consideration of inertial terms. 

We restrict our investigations to a purely mechanical framework. Consequently, 
thermal phenomena are excluded from the outset and dissipation is expressed in 
terms of mechanical quantities only. 
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3.1.2 Constitutive Equations, Associative Flow Rule 
and Karush-Kuhn-Tucker Conditions 

We consider the formulation of the J2 flow theory at finite strains for hyperelastic- 
plastic materials [7]. The constitutive equations, the flow rule and the KKT- 
conditions read: 



(4) 



(5) 


V > 0, f(x) < 0, yf( r) = 0 


(6) 


In (4)-(6), W is a stored energy-function, C is the Cauchy-Green strain tensor, 
b e is the elastic part of the finger tensor, r is the Kirchhoff stress tensor, / is the 
von-Mises flow-condition and y is the (Karush-Kuhn-Tucker) plastic multiplier. 


3.2 Numerical Methods 

The numerical computations of the quasi-static case are performed by an incremen¬ 
tal procedure, which contains a nonlinear sub-problem in every single incremen¬ 
tal step. 

3.2.1 Discretization 

We obtain the discrete material model by an implicit Euler method in time for 
the evolution of the internal parameters, e.g. plastic strains or hardening variables. 
Inserting this update in the constitutive equation (4), we get a nonlinear incremental 
stress response, which has to satisfy the balance of momentum (3). Thus, we 
have to solve a nonlinear variational problem for the displacements. In order to 
discretize our equations in space we use a standard finite-element-method with 
trilinear hexahedron elements for the displacements. 

3.2.2 Plasticity Algorithm 

To start with the treatment of the governing equations of plasticity, we adopt the 
return-mapping-algorithm (RMA) [7]. As remarked in [9], this well-established 
procedure, which requires low computational effort, may be turn unstable because 
it computes stresses that do not necessarily satisfy the global equilibrium equations. 
This drawback can be improved by having recourse to an algorithm based on 
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the linearization of (4)-(6). In [9], it is shown that such a criterion leads to a 
computational method with higher robustness in the case of small strains. The 
additional iteration, introduced due to the linearization of (4)-(6), defines an 
algorithm, which now iterates along stresses satisfying equilibrium, until (4)—(6) are 
“sufficiently” fulfilled. We extended this method to the case of finite deformations. 


3.2.3 Solution Methods 

In our code the nonlinear variational problem is solved by a Newton method. The 
consistent tangent operator, introduced by Simo and Taylor in 1985, provides the 
basis for the linearization therein. We use a parallel multigrid solver for solving 
the linear sub-problems featuring in the Newton scheme [10]. 


3.3 Numerical Tests 

The implementation of the RMA in our software framework UG (“Unstructured 
Grids” [8]) has been tested by some benchmark problems. 


3.3.1 Benchmark Problems 

As a reference for our numerical tests, we used a shear/compression-test of the unit- 
cube with perfect plastic behavior (Fig. 7). 

Additionally, we simulated the well-documented necking of a circular bar as an 
example for exponential hardening behavior [7]. 

3.3.2 Software Framework UG 

UG ([8]) is a general-purpose library for the solution of partial differential equations, 
which supports parallel adaptive multigrid-methods on high-performance comput¬ 
ers. A novel implementation ensures the complete independence of grid and algebra. 
Cache aware storage for algebra structures and a parallel communication layer make 
UG4 well suited for current and next hardware-architectures. 


4 Automatic Generation of Efficient Finite Element Codes 
4.1 Background 

Implementing finite element codes is a time-consuming task. Much work unrelated 
to the mathematical model to realize which is just details of implementation is 
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Fig. 7 (continued) 
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involved. An actual implementation is a long, often confusing code where the 
underlying mathematical model is difficult to recognize, which is also often a 
source of errors. Because of this it is desirable to be able to describe the FEM 
model in a more mathematical abstract way, which is automatically translated to 
code. The nature of the error-prone, typically repetitive code with many similar 
cases to handle (e.g. for boundary conditions) make the code for element matrix 
computation ideal for automatic code generation. The same applies to the sparse 
matrix operations involved as typically the matrix format it adapted to the problem 
and the target hardware. The possibility to easily make modifications to the FEM 
model is desirable for the development of more robust models, a code generator 
is developed which allows abstract definition of FEM models and tries to generate 
Fortran or C code from it which executes efficiently on several different handware 
platforms. 
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Fig. 7 (a) Unit cube in a 
shear/compression test, 

(b) max. stress at the 
midpoint of the cube, (c) 

1/8 th of a circular bar in a 
tensile test, and (d) change of 
necking area in time 
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4.2 Abstract Model Description 

The code generator is controlled via an input file which contains the necessary 
information of the mathematical model. The most important components of a FEM 
model were identified and an input format was specified. The following information 
is required to generate code for the computation of element matrices 

• Element geometries, node coordinates 
e.g. Quad, (0,0), (0,1), (1,0), (1,1) 

• Base for space of shape functions 
e.g. 1, x, y, xy 

• Differential operator to discretize 

e.g. \pv 2 (mass matrix) or (deformation gradient) 

• Boundary conditions 

e.g. Dirichlet, Neumann or combination 

• Integration rules 

Gaussian quadrature, or similar weighted points rule 
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The desciption of elements and integration rules is treated in a straightforward way. 
For example the description of a first-order 2D quad element would be done as 
follows: 

Element Quad { 

Type = Square; 

Nodes myNodeSet { 

Coordinates { 

{0,0}, 

(1,0), 

{0,1}, 

{1,1} 

} 

ShapeFunctions { 

Polynomial { 1 }, 

Polynomial { x }, 

Polynomial { y }, 

Polynomial { x y }, 

} 

} 

Coordinates = myNodeSet; 

Integration myRule { 

0.25 -> { 0.211325,0.211325 }, 

0.25 -> { 0.789675,0.211325 }, 

0.25 -> { 0.211325,0.789675 }, 

0.25 -> { 0.789675,0.789675 }, 

} 

} 

It is possible to define multiple node sets and integration rules to treat different 
degrees of freedom differently. 

The description of the operator to discretize is done in weak form, which is 
necessary to allow for the specification of partially integrated forms with derivatives 
of the test functions appearing. For example the Poisson equation in 2 dimensions 

(9* + 9y)iA = P 

would be treated in the partially integrated weak form 

J 9 x </)9 x i/f + dy(pdyi/ d 2 x = — J cppdrx 

which can be expressed in the input file as 

Matrix myExample { 

Elements { myQuad } 
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Coefficients { myNodeset:rho } 

Variables { myNodeset:psi } 

TestFunctions { myNodeset:phi } 

Operator leftside { 

diff(phi,x)*diff(psi,x)+diff(phi,y)*diff(psi,y) 

} 

Vector Operator rightSide { 

-phi*rho 

} 

} 


4.3 Generated Code 

At the moment the generator can produce code for the languages 

• C/C++ 

• Fortran 90 

• CUDAC 

• CUDA Fortran 

• OpenCL 

The code generator chooses for each target platform an implementation variant 
which takes into account hardware details to produce efficient code for this platform. 
For example the efficiency of sparse matrix storage formats is highly dependent 
of CPU and memory architecture and different patterns of vectorization should be 
used, depending on the SIMD length of the CPU. 


4.4 Current State 

At the moment the complete cycle of generation of element matrices is implemented 
and the generated code seems to produce correct result. Computation of nodal shape 
function from the basis, shape function derivatives, Jacobians, transformation of 
derivatives in the global coordinate system and numerical integration of operators 
are done as intermediate steps. This already relieves the developer of a great deal of 
work. The mathematical description can be stated in compact form and a complex 
computation code is automatically produced from the description. For example the 
complete input file of Poisson equation example as shown above is 45 lines long, 
which results in 1,047 lines of Fortran code. 

At the moment work is in progress related to the generation of sparse matrix 
assembly code. Aim is to produce efficient parallel code for both multicore CPUs 
using OpenMP and accelerators using CUDA or OpenCL. Another step which is 
planned to be done in the near future is handling of boundary conditions, which is 
of course important but currently not available. 



560 


C. Eck et al. 


4.5 Results 

Benchmarks of the generated code have been done for the Poisson equation example 
on a Intel Sandy Bridge CPU, a NVIDIA Tesla C2050 and an AMD Radeon 6970. 
The performance on all three platforms is promising as can be seen below but likely 
there is room for further optimization: 



Hardware peak performance 
(GFLOPS) 

Performance of generated code 
(2D Poisson) (GFLOPS) 

Sandy Bridge 

105 

«30 

Tesla C2050 

500 

«200 

Radeon 6970 

500 

«200 


The goal to create a tool which can be used to create real-world FEM applications 
has at least partially been achieved, but work will continue on adding more useful 
features and tuning performance for the different target platforms. 
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Characterization of Carrier Sense Multiple 
Access in Vehicular Propagation Channels 


J. Mittag and H. Hartenstein 


Abstract Wireless communications between vehicles is considered to be one 
of the building blocks in order to increase the safety level offered by future 
intelligent transportation systems. While it sounds intuitively convincing that a 
periodic exchange of status information, e.g. the current position, speed and driving 
direction, may help to avoid dangerous traffic situations or driving maneuvers, it 
is not clear whether the envisioned communications system, i.e. IEEE802.lip, is 
sufficiently reliable and robust. In particular, it is not clear whether the employed 
Carrier Sense Multiple Access (CSMA) mechanism employed at the medium 
access control (MAC) layer is able to coordinate concurrent access by multiple 
network nodes in a highly dynamic environment as intended. In this paper, we 
evaluate the performance of the CSMA-based coordination mechanism employed 
by IEEE802.lip. The evaluation is based on a network simulation framework 
that emulates the signal processing steps of a transceiver and accurately models 
the multi-path propagation effects of the wireless vehicular radio channel. Due to 
this accuracy, the execution of such high fidelity simulations is computationally 
highly expensive and represents a prominent example of the discipline called Com¬ 
putational Science and Engineering (CSE). Based on the results of our evaluation, 
we come to the conclusion that CSMA is able to coordinate concurrent access in 
vehicular environments, even if fading radio propagation characteristics are present. 
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1 Introduction 

Vehicle-to-vehicle (V2V) communications is required for numerous applications 
that aim to improve traffic safety. By a periodic wireless exchange of status related 
messages among neighboring vehicles, a mutual awareness should be established. 
Through this awareness vehicles are expected to detect potentially dangerous traffic 
situations in advance and are therefore able to avoid fatal driving maneuvers. In 
order to ensure a reliable operation of such active safety systems, the deployed wire¬ 
less communications system has to be reliable and robust. In particular, the physical 
layer has to deal sufficiently well with fading radio propagation characteristics. 
Likewise, medium access control mechanisms have to coordinate concurrent access 
properly within a wide range of radio propagation conditions as well. Whether this 
is actually achieved is an open question that can not be answered easily. 

To obtain a solid answer, we opt for a simulation based evaluation that considers 
primary and relevant influencing factors, in particular a valid representation of 
the channel effects. In [2, 5] we therefore presented an extension of the open 
source NS-3 network simulator [6], in particular the addition of a signal-level 
implementation of the IEEE802.lip physical layer (PHY) and medium access 
control (MAC) specifications - which are envisioned to be used in a first generation 
of vehicle-to-vehicle communication systems. The proposed modeling approach 
replaces simplified models that have been used prior to our proposal and is accurate 
enough to evaluate whether the IEEE 802. lip based PHY and MAC are sufficiently 
robust against the fading exhibited in vehicular radio propagation channels. Since 
the highly data intensive signal processing leads to significant runtime performance 
slowdowns, we evaluated the benefit of parallel processing capabilities in [1], with 
the conclusion that noticeable speedups are only achievable in many-core system 
architectures with more than a few hundred compute units - for instance when using 
a GPU-based simulation architecture. 

In this report, we present the results of a simulation campaign that addresses the 
question whether the IEEE802.1 lp based MAC layer is robust with respect to the 
fading characteristics typically present in vehicular radio channels. In particular, 
we characterize the coordination performance of Carrier Sense Multiple Access 
(CSMA), which is the fundamental MAC mechanism employed by IEEE802.lip, 
over a wide range of scenario configurations and radio propagation conditions. With 
CSMA, each network node listens to the channel prior to an own transmission, and 
starts a transmission only if the channel is not considered busy. If the channel is 
busy, the own transmission is deferred until the channel becomes idle again. Since 
we have no access to a GPU-based compute cluster (on which we can exploit 
our findings presented in [1] and on which the runtime performance of a single 
simulation experiment would be “optimal”) we use the most simple parallelization 
method: batch processing based on the parallel job scheduling capabilities offered 
by the HP XC4000 at the Steinbuch Centre for Computing. 

The rest of this paper is structured as follows: Sect. 2 presents the evaluation 
methodology and introduces the performance metrics that were used to characterize 
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the coordination efficiency of CSMA. Section 3 then presents a selected subset of 
the obtained results to illustrate the outcome of the evaluation and provide an answer 
to the question whether CSMA is able to coordinate concurrent access sufficiently 
well in fading environments. Section 4 eventually concludes this paper with a brief 
summary. 


2 Evaluation Methodology 

The characterization of CSMA is performed considering a highway environment 
with vehicles being placed uniformly on a 5 km long road with 2 lanes per 
direction. The highway environment is chosen since considerably high velocities, 
and hence pronounced time- and frequency-selective propagation characteristics, 
can be expected in this setting. A simple broadcast application that is running on 
each vehicle generates periodic awareness messages at an average rate r (in Hz). 
In order to introduce a small randomness, a small jitter is applied to the interval 
between two subsequent awareness messages. 

To evaluate different network saturation levels, application specific impact 
factors are varied, i.e. the packet generation rate r is set to either 2, 5 or 10 Hz, 
the transmission power is set to either 5, 10, 15 or 20dBm, and the size of an 
awareness message is set to 200 or 400 bytes. Furthermore, three different average 
vehicle densities are considered to vary the number of transceivers for which 
concurrent access has to be coordinated: 40, 80, and 120 vehicles/km. Although 
mobility is considered in order to simulate fast-fading channel conditions, vehicles 
are configured to keep their (initial) positions. Since CSMA does not employ any 
slot reservation technique, and vehicles would not alter their positions significantly 
during a few miliseconds (with respect to the dimension of the network in terms 
of communication range), the topology of the network can be considered stationary 
during the channel contention period. This configuration should therefore not affect 
the relevance of the obtained results. Nevertheless, in order to enable an application 
of channel fading models, a (fake) average mobility of 100km/h is considered by 
radio propagation models. 

With respect to IEEE802.lip medium access control, a basic Distributed 
Coordination Function (DCF) with a Clear Channel Assessment (CCA) threshold 
of —91 dBm, a fixed contention window size of 15 slots, and a slot time of 13 |is is 
used. Further, each vehicle is configured to use a data rate of 6 Mbps in a 10 MHz 
channel at a carrier frequency of 5.9 GHz. An overview of additional relevant 
configuration parameters is given by Table 1. 

Most importantly, the impact of three different radio propagation characteristics 
is evaluated. Initially, only a distance dependent deterministic path loss is considered 
to study the coordination performance in the absence of any channel fading charac¬ 
teristics. Such a consideration enables the identification of the fundamental CSMA 
weaknesses, and serves as a reference when analyzing the results of subsequent 
simulations in which fading is considered. As proposed by Kunisch et al. in [3], only 
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Table 1 Application layer, medium access control and physical layer 
parameters and the settings used for the packet collision characterization 


Parameter 

Value 

Application layer 

Packet size 

200, 400 bytes 

Transmission rate 

2, 5, 10 Hz 

Transmission power 

5, 10, 15, 20 dBm 

Medium access control layer 

Slot time 

13 p.s 

Contention window size 

15 slots 

CCA busy threshold 

-91 dBm 

Physical layer 

Modulation scheme 

QPSK 

Coding rate 

1/2 

Channel bandwidth 

10 MHz 

Carrier frequency 

5.9 GHz 

Tx center frequency offset tolerance 

20 ppm 

Capture threshold 

8 dB 

Noise level 

-99 dBm 


a power law model with a reference loss of 59.7 dB and a path loss exponent of 1.85 
is used. Then, large-scale fading characteristics that follow a Normal distribution 
with a = 3.2 dB are introduced. In a last step, the effect of a small-scale fading is 
analyzed by the simulation of a Rayleigh fading channel using the Jakes Doppler 
spectrum (instead of the Normal shadowing). 

During each simulation run, the “lifetime” of all packets transmitted by 20 
selected reference nodes (which are located in the center of the scenario to avoid 
border effects) is monitored and evaluated with respect to different performance 
metrics. The applied metrics are classified with respect to receiver and transmitter 
perspective. 

From the perspective of a transmitting (reference) node, the primary metric that 
describes the coordination efficiency of a MAC protocol is the probability that one 
or multiple other nodes are incoordinated to an own transmission. 

Definition 1 (Packet Level Incoordination, PLI). The packet level incoordina¬ 
tion, as observed from the perspective of a node r and one of its generated packets p, 
describes the probability that at least one node s,s ^r, transmitted a packet q during 
the transmission period of p. 

Apart from the quantification of the PLI, it is also important to resolve the type 
of incoordination. With respect to CSMA, the reason for an incoordination could be 
that either both nodes started their transmission at exactly the same points in time, 
e.g. due to simultaneous expiration of backoff timers, or that the incoordinated node 
did not sense the reference transmission, e.g. due to shadowing or fading channel 
characteristics. To distinguish between both cases, the time difference between both 
transmissions has to be evaluated, which leads to the definition of the Incoordination 
Delay Profile. 





Characterization of CSMA in Vehicular Propagation Channels 


565 


Definition 2 (Incoordination Delay Profile, IDP). The incoordination delay pro¬ 
file describes the probability distribution of the starting time differences between a 
set of packet transmissions P = \ p \,..., p„ } and each packet’s corresponding set 
of incoordinated transmissions Qj = {qn, ■ ■ ., qp }, with 1 < i < n, and j being 
the number of packets interfering with />,. 

In case of CSMA and distance decaying deterministic channel conditions, the 
IDP should indicate that all incoordinated nodes located within the carrier sense 
range - the range within which the received signal will exceed the CCA busy 
threshold - transmit more or less simultaneously. Only incoordinated nodes outside 
the carrier sense range should show significantly greater delays. To determine the 
effectivity of CSMA with respect to this controlled spatial reuse of the channel, the 
IDP is evaluated with respect to the distance between sender and incoordinated node 
in the following. 


3 Results 

The following section presents and discusses a representative subset of the obtained 
results. This subset is sufficient in order to illustrate the fundamental findings and 
to demonstrate the conclusions that are eventually drawn. In particular, due to 
space restrictions, only the 80 vehicles/km scenario with a 400 bytes packet size 
configuration is considered as a reference in the following. The reader who wishes 
to evaluate and screen the complete set of results is able to access the results as well 
as the full source code of all experiments online [4]. 

Figure 1 illustrates the observed packet level incoordination (with respect to the 
range in which incoordination is considered) using three different packet generation 
rates, the selected vehicle density of 80 vehicles/km, and three different channel 
configurations, i.e. a deterministic path loss only setup, a setup with additional 
Normal shadowing, and a setup with additional Rayleigh fading. As illustrated, 
almost no incoordination is observed within a range of approx. 700 m for most 
scenario configurations that exhibit only a deterministic path loss, cf. Fig. la. This 
is an expected result, since the range of 700 m reflects the area within which the 
received signal strength stays above the configured —91 dBm CCA busy threshold. 
CSMA can therefore be certified to achieve its design goal in such an environment 
and setup. 

If a large-scale fading based on a Normal shadowing is assumed, the correspond¬ 
ing PLI curves are slightly increased in comparison to the deterministic path loss 
setup, cf. Fig. lb. Particularly for ranges slightly below and above the deterministic 
carrier sense range of 700 m a noticeable increase of the PLI can be observed. This 
is also expected, since the large-scale fading leads to situations in which a vehicle 
located within the deterministic carrier sense range experiences received signal 
strengths below the —91 dBm CCA busy threshold, hence these vehicles will not 
block their MAC layer and might interfere with one of the reference transmissions. 
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Fig. 1 Probability of PLI 
with respect to the range 
within which incoordination 
is considered for a 20 dBm 
transmit power, different 
channel configurations, and a 
2, 5 or 10 Hz setup. 

(a) Distance decaying 
pathloss. (b) Normal 
shadowing, (c) Rayleigh 
fading 
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Considered Range [m] 
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This phenomenon is increased if a Rayleigh fading channel is considered, cf. 
Fig. lc). Whether this small increase is still sufficiently low can not be answered 
at this point, since no statement about the resulting packet delivery ratios is made. 
In general terms, an additional evaluation of the MAC layer performance from a 
receiver perspective is required in order to answer the question whether a small 
increase of the PLI is significant or not. 

As shown in Fig. la, a small amount of incoordinated transmissions is observed 
within the deterministic carrier sense range despite the absence of any signal fading. 
According to the design of CSMA, such an incoordination should only occur if 
the transmitting vehicle (i.e. the reference node) and the interfering vehicle start to 
transmit at (more or less) the same point in time. In such a situation, the interfering 
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Fig. 2 Observed incoordination delay profiles with respect to the distance between reference 
and incoordinated node for a 80 vehicles/km scenario, non-fading channel conditions, a 20 dBm 
transmit power and a 10 Hz packet generation rate 


node is inherently not able to sense the signal of the reference node prior to its own 
transmission. In order to evaluate this aspect, the observed IDPs with respect to the 
distance between the reference node and a potentially interfering node are discussed 
in the following. Since the benefit of showing results for multiple scenarios is 
marginal, the setup of a 80 vehicles/km scenario with a 20 dBm transmit power, 
a 10 Hz packet generation rate and a packet size of 400 bytes is considered in the 
following. 

Figure 2 illustrates the observed IDPs with respect to the distance between a 
reference node and interfering nodes for the deterministic channel configuration 
and a 10 Hz packet generation rate setup. Since the length of a 400 byte packet is 
equal to 576 p,s in the time domain, the time difference between the transmission 
of a reference node and the transmission of an incoordinated node can be at most 
576 pis. This maximum value is however not observed for interferers located within 
a range of 700 m. Indeed, as expected in this setup, incoordination from vehicles 
located within the carrier sense range occurs only if interfering vehicles transmit 
exactly at the same point in time as the reference node. Hence, CSMA can again be 
certified to fulfill its objective. Please note that a more or less uniform distribution of 
the incoordination delay is observed for all distances greater than the carrier sense 
range. 

Unsurprisingly, the situation changes if fading characteristics are introduced. As 
shown in Fig. 3a, which plots the IDPs obtained in a Normal shadowing configura¬ 
tion, the range within which incoordination occurs only due to identical transmission 
times decreases to approx. 300 m. For greater distances, the IDP tends towards a 
uniform distribution of the incoordination delays. A similar but more intense effect 
can be observed in the Rayleigh fading channel configuration, cf. Fig. 3b. In such an 
environment, the principle of CSMA has its difficulties to avoid incoordination that 
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Fig. 3 Observed incoordination delay profiles with respect to the distance between reference and 
incoordinated node for a 80 vehicles/km scenario, fading channel conditions, a 20 dBm transmit 
power and a 10 Hz packet generation rate, (a) Normal shadowing, (b) Rayleigh fading 


is not caused by identical transmission times. Recall that incoordination is still less 
likely to occur at such close distances (in comparison to remote distances). Only 
the time domain characteristic of incoordinated transmissions at close distances has 
changed, and is not fundamentally different from incoordinated transmissions at 
remote distances anymore (as it was in the deterministic path loss only channel 
configuration). 




























Characterization of CSMA in Vehicular Propagation Channels 


569 


4 Conclusions 

In this paper, we studied the coordination performance of Carrier Sense Multiple 
Access (CSMA) in vehicular environments, in particular under the influence of 
fading radio propagation conditions. We defined two metrics which characterize 
CSMA’s ability to suppress concurrent and interfering packet transmissions by 
nodes located within each other’s close surrounding. We further employed a high 
fidelity network simulation framework to capture and accurately model the impact 
of a multi-path radio propagation channel and the influence of signal processing 
algorithms at the physical layer. Since such an accurate modeling approach is highly 
expensive with respect to the computational costs we made use of the HP XC4000 
operated by the Steinbuch Centre for Computing (SCC) at the Karlsruhe Institute 
of Technology (KIT). According to the obtained results, CSMA can be certified 
to fulfill its design objectives, in the sense that it is able to effectively suppress 
concurrent packet transmissions by nodes located within each others carrier sensing 
range. If fading propagation conditions are present, the effectiveness of CSMA is 
slightly reduced. In order to answer whether this reduction has a significant impact 
on the packet reception performance or not, we plan to evaluate the performance of 
CSMA from the perspective of a receiver in a follow-up work. 
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Abstract With nearly one billion online videos viewed everyday, an emerging new 
frontier in computer vision research is recognition and search in video. While much 
effort has been devoted to the collection and annotation of large scalable static 
image datasets containing thousands of image categories, human action datasets 
lag far behind. Current action recognition databases contain on the order of ten 
different action categories collected under fairly controlled conditions. State-of-the- 
art performance on these datasets is now near ceiling and thus there is a need for 
the design and creation of new benchmarks. To address this issue we collected the 
largest action video database to-date with 51 action categories, which in total contain 
around 7,000 manually annotated clips extracted from a variety of sources ranging 
from digitized movies to YouTube. The goal of this effort is to provide a tool to 
evaluate the performance of computer vision systems for action recognition and 
explore the robustness of these methods under various conditions such as camera 
motion, viewpoint, video quality and occlusion. 
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1 Introduction 

We attempt to advance the field with the design and collection of a large video 
database dubbed Human Motion DataBase (HMDB) that tries to capture the 
richness and complexity of human actions. With 51 distinct action categories each 
containing at least 100 clips for just a little under 7,000 video clips extracted 
from a variety of sources, the proposed database is, to our knowledge, the largest 
and perhaps the most realistic available databases. Each clip of the database was 
validated by at least two human observers to ensure consistency. In addition to 
category labels, meta-data was added to facilitate further selection, pre-processing 
and training of recognition systems (i.e., number of actors involved in the action, 
view-point, approximate viewing-distance, presence/absence of camera motion, 
video quality, etc.). 

First, we use this database to evaluate the performance of two representative 
computer vision systems: We considered the biologically-motivated action recog¬ 
nition system by Jhuang et al. [7], which is based on a model of the dorsal 
stream of the visual cortex and was recently shown to achieve human level of 
performance for the recognition of rodent behaviors in the constrained setting of the 
homecage environment [6]. We also considered the popular spatio-temporal bag-of- 
word system by Laptev and colleagues [9, 10,15]. We here compare the performance 
of these state-of-the-art systems on the HMDB database, evaluate their robustness 
to various sources of image degradations (camera motion, occlusion and changes 
in view-point) and discuss on the relative role of shape vs. motion information for 
action recognition. 


2 Related Work 

With several billions of videos currently available on the internet and about 24 h 
of video uploaded to YouTube every minute, there is an immediate need for robust 
algorithms that could help organize, summarize and retrieve this massive amount of 
data. While much effort has been devoted to the collection of realistic internet-scale 
collection of static image databases [4,5, 11,13,14, 18], current action recognition 
datasets lack far behind. The three most popular benchmark databases (i.e., KTH 
[1], Weizmann [3] and the IXMAS [16] datasets contain around 6-11 actions each. 
These databases are not quite representative of the richness and complexity of real- 
world action videos as they are fairly well constrained in terms of illumination 
and camera position. A typical video clip contains a single (staged) actor with no 
occlusion and very limited clutter. Recognition rates on these datasets tend to be 
very high. For instance, a recent survey of action recognition system comparison 
[17] reported that 12 out of 21 systems tested perform better than 90% on the 
KTH dataset. For the Weizmann dataset, 14 out of 16 tested systems perform at 
90% or better, 8 out 16 better than 95 % and 3 out of 16 scored a perfect 100% 
recognition rate. 
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Fig. 1 Comparison between current action recognition datasets 

Recent action datasets try to build more realistic action recognition datasets 
(Hollywood and UCF50) by considering video clips taken from real movies and 
youtube. These datasets are more challenging due to large variations in camera 
motion, object appearance, pose, scale and viewpoint as well as cluttered back¬ 
ground, etc. The UCF50 database (available at http://server.cs.ucf.edu/~vision/data/ 
UCF50.rar) extends the 11 original action categories from the YouTube action 
dataset and consists of 50 action categories with realistic videos taken from youtube. 
For all the 50 categories, the videos are grouped into 25 groups, where each group 
consists of more than 4 action clips. 

The UCF50, its premier version UCF sports, and a recent Olympic sports dataset 
[19] contain mostly sport videos from YouTube. These types of action are relatively 
unambiguous (as a result of searching for specific titles on YouTube), and are highly 
distinguishable from shape cues alone (i.e., such as the raw positions of the joints or 
the silhouette extracted from single frames. To demonstrate this point, we conducted 
a simple experiment: We manually annotated stick-figures (i.e., 9 line segments 
corresponding to the two upper and lower arms, the two upper and lower legs and 
the body trunk) from 5 randomly selected clips from each of the 13 action categories 
on the UCF sport dataset. Using a leave-one-clip-out procedure, classifying the raw 
joint locations from single frames lead to a recognition rate above 98 % (chance level 
8 %). We conducted a very similar experiment on the proposed HMDB database 
where we drew from 10 action categories similar to those used in the UCF (e.g., 
climb, climb-stairs, run, walk, jump, etc.) and manually annotated the joint locations 
for a set of over 1,100 random clips (see Sect. 4 for details). The accuracy reached 
by a classifier using the joint location as inputs reached only 35 % this time (chance 
level 10%) and performed below the level of performance of the same classifier 
using motion features [7]. Such dataset may thus be a better indicator of the relative 
contributions of motion vs. shape cues for the recognition of real-world actions (see 
Sect. 4). Figure 1 shows a comparison between existing action recognition datasets. 
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3 Design of the Dataset 

It has been estimated that there are over 1,000 human action categories. In order 
to isolate human actions that are representative of everyday actions, we first asked 
a group of students to watch videos using a subtitle annotation tool to annotate 
any segment of these videos that they deemed to represent a single non-ambiguous 
human action. Students were asked to consider a minimum quality standard (i.e., a 
single action per clip, a minimum of 60 pixels in height for the main actor, minimum 
contrast level, minimum action length of about 1 s and acceptable compression 
artifacts). Students considered videos from three sources: digitized films available 
on the internet, public databases such as the Prelinger archive as well as YouTube 
and Google videos. A first set of annotations was thus generated in this way with 
over 60 action categories. To further guarantee that we would be able to populate 
all action categories with at least 101 different video clips we considered the top 51 
action categories and asked students to specifically look for these types of actions. 

This database, that we call HMDB51, comprises 51 distinct human action 
categories, with overall 6,474 clips from 1,697 unique source videos. We also 
collected meta-data towards a precise evaluation of the limitation of current 
computer vision systems. Each clip annotation contains a field indicating the visible 
body parts (head, upper body, lower body, full body), whether the camera is moving 
or static, how many people are involved in performing the action (single action vs. 
two-people vs. multiple people actions), as well as the camera orientation relative to 
the actor (front, back, left or right). The clips were also annotated according to their 
video quality from (a) good (i.e., detailed visual elements such as the fingers and 
eyes of the main actor are identifiable through most of the clip with limited motion 
blur and compression artifacts), (b) medium (i.e., larger body parts like the upper 
and lower arms and legs are identifiable through most of the clip) and (c) fair (i.e., 
even larger body parts are not identifiable due in part to the presence of motion blur 
and compression artifacts). The distribution of labels in the HMDB50 is as follow: 

• Clips with camera motion - 59.9 % vs. static camera - 40.1 % 

• Camera position: front-40.8%, back - 18.2%, left-22.1 %, right - 19.0% 

• Clip quality: good - 17.1 %, medium - 62.1 %, fair - 20.8 % 

• Visible body part: full body - 56.3 %, head - 12.3 %, lower body - 0.8 %, upper 
body - 30.5 % 


3.1 Preprocessing Steps 

Normalization 

The original videos sources used to extract the action video clips varied in size (from 
176x 132 to 1,280x720) as well as frame rates (6-60 fps). To ensure consistency, we 
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thus resized all extracted clips to a height of 240 pixels (using bicubic interpolation 
over a 4 x 4 neighborhood). The width of the clips was scaled accordingly so as to 
maintain the original aspect ratio. We further normalized all video frame rates (by 
either dropping or duplicating frames) to ensure a fixed 30 fps frame rate over all 
clips. This was done using \hejfmpeg video library. All clips were then compressed 
using the DviX 5.0 codec with the .avi output format. 


Video Stabilization Using Point Feature Matching 

One significant challenge associated with the use of video clips extracted from real- 
world videos is the potential presence of significant camera/background motion 
(about 2/3 of the clips in our database, see above). Such camera motion can interfere 
with the local motion computation (see Results) and should potentially be corrected. 
Stabilizing these videos is thus a key pre-processing step. A simple way to correct 
for camera motion is to apply a basic image stitching algorithm to align successive 
frames according to the camera motion. A background plane is estimated by first 
detecting and then matching salient features (using the Harris corner detector) 
between adjacent frames. Correspondences are computed using a distance measure 
that includes both the absolute pixel differences within a 15 x 15 image patch 
centered on the corner point and the Euler distance of the corner points. Corner 
points with the minimum distance are then matched and the RANSAC algorithm is 
used (50 % inlier with 95 % confidence) to estimate the geometric transformation 
between all neighboring frames from these noisy correspondences. This is done 
independently for every pair of frames and after smoothing a cumulative image 
transformation is then computed and movie frames are then warped using this 
estimate to achieve a stabilized video. 


Training and Test Splits Generation 

For evaluation purposes three distinct training and test splits were generated from 
the database. This splits were generated so as to ensure that (1) the same video 
source could not be used for both training and testing and that (2) the relative 
distribution of camera positions (view-point) as well as the proportion of clips 
with/without camera motion, the video quality and visible body parts would be 
balanced across the training and test sets. To do this, we implemented a very simple 
constraint satisfaction approach to select from a large number of completely random 
splits. We first picked the one best split according to these constrains (split_l) and 
then the second and third best splits that would be least correlated to this first split 
(see Table 1 for a measure of overlap between the splits as measured by a normalized 
Hamming distance) 
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Table 1 Normalized Hamming distance between the three training and test splits generated so as 
to minimize the number of video sources present in both the training and test set and to maintain 
similar statistics across view-points, amount of camera motion, etc. 


(s l, s 2) 

(s l, s 3) 

(s22, s 3) 

Random splits 

0.34 

0.33 

0.33 

0.16 ±0.10 


4 Evaluation 

4.1 Benchmark Systems 

4.1.1 Biologically-Motivated Action Recognition System 

Jhuang et al. have described a computational model of the dorsal stream for the 
recognition of actions [7]. The model starts with spatio-temporal filters mod¬ 
eled after motion-sensitive cells in the primary visual cortex [12]. Just like the 
VI-like simple units in the model of the ventral stream described above, these units 
are tuned to specific orientations. As opposed to those in the model of the ventral 
stream, which respond best to static stimuli, the V1 -like simple units in the model 
of the dorsal stream respond best to a bar moving in a direction orthogonal to its 
preferred orientation. It has been suggested that motion-direction sensitive cells and 
static cells constitute two channels of processing, the former projecting to the dorsal 
stream and the latter to the ventral stream. During an unsupervised developmental- 
like learning stage, units in intermediate stages of the model become tuned to 
optic-flow patterns that appear with high probability in natural sequences of images. 
These optic-flow pattern units correspond to the combination of several complex cell 
receptive fields (tuned to different directions of motion instead of spatial orientations 
in the model of the ventral stream). Here we obtained code directly from the website 
of the authors. 


4.1.2 Spatio-Temporal Bag-of-Features (ST-BoF) 

Local space-time features have recently become a popular video representation 
for action recognition. Much like their static local spatial features counterpart for 
the recognition of objects and scenes, they have been shown to achieve state- 
of-the-art performance on several standard action recognition databases [10, 15]. 
An extensive comparison between existing methods (feature detectors and local 
descriptors) for the computation of space-time features in a common experimental 
setup was described in [15]. 

We implemented a system based one of the most commonly used system 
configuration using a combination of the Harris3D detector and the HOGHOF 
descriptors. For every clip we detected 3d Harris corners and computed the 
combination of histograms of oriented gradients(HOG) and oriented flow (HOF) 
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as local descriptor. To build the code book between 75000 and 250000 space-time- 
interest-point-descriptors were sampled from the dataset and a k-means clustering 
with k = 2,000—10,000 was applied on this sample set. The space-time-interest- 
point descriptors of every clips are matched to the nearest prototype as returned 
by k-means clustering and the histogram is build over the mapping. This leads 
to a k-dimensional feature vector where k is the number of clusters used for 
k-means. This feature vector is then used as input to an SVM classifier. For the 
classification a regularized support vector machine is used with an RBF kernel 
K(u, v) = exp (—y * |u — v| 2 )). The parameters of the RBF kernel (cost term and 
y) were optimized using a greedy search with a five-fold cross-validation on the 
training set over c = 2~ 5 , 2 -3 ,... 2 25 and y = 2~ 15 , 2~ 13 ,... 2 15 ). 


4.2 Evaluation of the Two Systems Performance 

We first evaluated the overall performance of both systems on the HMDB as well as 
on the ucf50 dataset. On the HKM, both systems show a similar mean recognition 
performance around 20 % whereas the performance of ST-BoF systems is 2 % 
lower than the system from Jhuang et al. [7] (Fig. 2). The recognition the ucf50 
dataset, the recognition performance of both systems is again very close. Both have 
a performance about 45 % whereas in the case, the system of ST-BoF systems is 
2.4% better than the system of Jhuang et al. [7]. So there’s no system showing a 
clear superiority above the other, but it becomes clear that the datasets are fairly 
different (Table 2). 


4.3 Comparison of the Systems Performance on Ten Common 
Action Categories from UCF Versus HMDB 

In order to achieve a baseline for the performance of HMDB according to 
other databases as UCF50, we identified ten categories that were similar or 
equal in both datasets (UCF50/HMDB50): basketball/shoot.ball, biking/ride.bike, 
diving/dive, fencing/stab, golf swing/golf, horse riding/ride_horse, pull ups/pullup, 
push ups/pushup, rock climbing indoor/climb and walking with dog/walk. We 
evaluated the performance of ST-BoF recognition system with the original clips 
as well as with the stabilized ones on both datasets. On the HMDB the mean 
classification rate for the ten categories is at 54.3 % (57.3 % for motion stabilized 
clips) whereas the classification rate for the UCF50 for the same categories is at 

66.3 % (68.7 % for stabilized clips). These results suggest that the HMDB is a more 
challenging and perhaps richer dataset. 
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Fig. 2 Confusion matrix for the ST-BoF system on the stabilized clips 

Table 2 Average performance for the two benchmark systems used here (Laptev et al. [10] with 
k = 5,000 clusters and Jhuang et al. [7]) on the HMDB and UCF50 datasets 


Databases (orig./stab.) 

Laptev et al. (%) 

Jhuang et al. (%) 

HMDB 51 

20.4/21.9 

22.8/23.1 

UCF50 

45.4/49.3 

43/- 


4.4 Robustness of the Two Systems 

In order to asses the relative strengths and weaknesses of the two benchmark 
systems, we broke down their performance in terms of: (1) the quality of the video 
clips, (2) occlusions, (3) camera position and (4) camera motion. The results are 
summarized in Table 3. 


Camera Motion 

For the ST-BoF approach, 23.1 % of the clips with camera motion were classified 
correctly while the classification rate dropped down to 17.6 % for clips with camera 
motion. When tested on stabilized clips, the recognition rate for clips without motion 
remained stable at 23.8% whereas the classification rate for the clips with camera 
motion increased by 7.7-25.3 %. Here it can be seen, that camera motion does effect 
the results of the ST-BoF system. Surprisingly the performance of the C2 features by 
Jhuang et al. was higher on the clips with camera motion than on the clips without 
camera motion (17.4 vs. 25.8 %). As evident from the decrease in performance when 
clips were stabilized for motion, it seems that the system is somehow able to pick on 
the camera motion, which, on this dataset seems to be correlated with the category 
label. This will require further investigation. 
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Table 3 Average performance for the two benchmark systems used here (Laptev et al. [10] with 
k = 5,000 clusters and Jhuang et al. [7] with n = 2,000 C2 features) analyzed by their robustness 
to various sources of video degradations 


HMDB(orig./stab.) 

Camera motion 




Without camera motion (%) 

With camera motion (%) 


Laptev et al. 

23.1/23.8 

17.6/25.3 


Jhuang et al. 

17.4/22.9 

25.8/21.6 



Clip quality 




HMDB (orig./stab.) 

Good (%) 

Medium (%) 


Fair (%) 

Laptev et al. 

32.0/26.3 

16.4/24.7 


19.3/22.8 

Jhuang et al. 

21.9/23.8 

22.9/23.0 


21.6/16.4 


Camera position 




HMDB (orig./stab.) 

Front (%) 

Back (%) 

Left (%) 

Right (%) 

Laptev et al. 

20.6/20.9 

16.8/20.7 

16.5/30.1 

24.9/30.3 

Jhuang et al. 

19.2/20.4 

20.1/23.3 

26.8/23.1 

27.6/23.4 


Visible body part 




HMDB(orig./stab.) 

Full body (%) 

Upper body (%) 

Lower body (%) 

Head (%) 

Laptev et al. 

17.8/24.8 

22.7/27.2 

27.3/27.3 

21.1/19.1 

Jhuang et al. 

24.6/20.3 

20.7/25.8 

46.7/21.4 

16.7/21.0 


Clip Quality 

Here again we found a surprising result. While the performance of the ST-BoF 
approach seems to be influenced by the quality of the clips, the system by Jhuang 
et al. seems almost completely invariant to degradations in the quality of the videos. 
While the two systems achieve comparable rates on the medium and fair quality 
clips, the ST-BoF seems to perform significantly better on the good quality clips 
(on the original clips the performance of the ST-BoF vs. Jhuang et al. approach is 
32.0 vs. 21.9 %). A further evaluation of the dataset shows that in the subset of good 
quality clips, 1/3 of the clips contain camera motion, against 2/3 for the medium- 
quality and 3/4 for the bad quality clips. This suggest that the evaluation of the 
clip quality on systems performance is most likely contaminated by the presence 
of camera motion. We feel that this confound will likely start to decrease as the 
database grows in size. 


Camera Position 

With the exception of the lower body subset (with only 0.9 % of the total clips 
results on this subset are non-significant), it seems that the position of the camera in 
the video does not influence the performance of the two systems significantly. 
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Table 4 Average performance for the two benchmark systems used here (Laptev et al. [10] and 
Jhuang et al. [7]) on the HMDB and UCF50 datasets 


HMDB 50 

HOGHOF (%) 

HOF (%) 

HOG (%) 

orig 

19.7 

16.3 

15.9 

stab 

24.7 

24.6 

17.1 


Visibility of Different Body Parts 

For the evaluation of the different body parts, it can be said that the recognition 
is robust for clips with camera motion as well as for those without. Only the clips 
showing the full body of the character can be improved. This holds as well for the 
C2 features. The increase of the recognition of clips where only the lower body is 
visible shows that the algorithm can also deal with uncommon types of clips as this 
property holds only for a small fraction of clips and is not as much present in the 
training and testing set as the variations. 


4.5 Shape Versus Motion Information 

Table 4 shows a comparison between the performance of the original STBoF system 
(using the HOGHOF descriptor) as well as the contribution of shape alone (HOG) 
and motion alone (HOF) cues. 


5 Conclusion 

We have described an effort to advance the field of action recognition with the design 
of what is to our knowledge currently the largest action recognition database. With 
currently 50 action categories and a little under 7,000 video clips, the proposed 
database is still far from capturing the richness and the full complexity of video 
clips commonly found on the internet. However given the level of performance of 
representative state-of-the-art computer vision algorithms (i.e., about 25 % correct 
classification with chance level at 2 %), this initial database is arguably a good place 
to start (performance on the CalTech-101 database for object recognition started 
around 16% [8]). Furthermore our exhaustive evaluation of these two systems 
suggest that performance is not significantly affected over a range of factors such 
as camera position and motion as well as occlusions. This suggests that current 
methods are fairly robust with respect to these low-level video degradations but 
remain limited in their representative power in order to capture the complexity of 
human actions. 
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to Gyrotron Resonator Simulations 
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and M. Auweter-Kurtz 


Abstract Growing computational capabilities and simulation tools based on high- 
order methods allow the simulation of complex shaped plasma devices including 
the entire nonlinear dynamics of the Maxwell-Vlasov system. Such simulations 
model the particle-field-interactions of a non-neutral plasma without significant 
simplifications. Thereby, new insights into physics on a level of detail that has never 
been available before provide new design implications and a better understanding of 
the overall physics. We present a high-order discontinuous Galerkin method based 
Particle-In-Cell code for unstructured grids in a parallelization framework allowing 
for large scale applications on HPC clusters. We simulate the geometrically complex 
resonant cavity of the 170 GHz gyrotron aimed for plasma resonance heating of the 
fusion reactor ITER and we demonstrate that a highly efficient parallelization is a 
crucial requirement to simulate such a complex large-scale device. 
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1 Introduction 

For the numerical simulation of highly rarefied plasma flows, a fully kinetic mod¬ 
eling of the Boltzmann equation completed by the Maxwell equations is necessary. 
In the present report, we neglect collisional effects and focus our attention on the 
Maxwell-Vlasov equations allowing the self-consistent investigation of collective 
plasma phenomena. The numerical methods to tackle this non-linear problem in six¬ 
dimensional phase space are very briefly reviewed in Sect. 2. In Sect. 3 we present 
the highly scalable parallelization algorithm followed by Sect. 4 showing the results 
obtained from the scaling experiments and the gyrotron resonator simulations. 
Finally, a short outlook on further activities is given in Sect. 5. 


2 Numerical Framework 

A powerful method to numerically treat the non-linear Maxwell-Vlasov problem in 
six-dimensional phase space is the PIC approach [1,5], which has a long history 
of more than five decades. The peculiarity of the PIC method is the ingenious 
particle-mesh technique for the coupling of an Eulerian grid-based model for the 
Maxwell equations with a Lagrangian approach for the Vlasov equation. To get an 
overview of the numerical methods applied within the PIC framework, a single PIC 
cycle, schematically depicted in Fig. 1 , is discussed. The rarefied non-neutral plasma 
inside a device is represented by a sample of charged simulation particles of, in 
general, different species. In each time step, the electromagnetic fields are obtained 
by the numerical solution of the Maxwell equations. In the context of the present 
PIC solver, a discontinuous Galerkin (DG) method for these equations is applied 
where, in addition, a hyperbolic divergence cleaning technique [9] is considered. 
Especially the use of a powerful mixed nodal and modal DG approach [3,4] allows 
a fast and high-order space discretization. Afterwards, the electromagnetic fields 
are interpolated to the actual spatial positions of the simulation particles [11], 
These charged particles experience a force and thus an acceleration due to the 
electromagnetic fields. According to the Lorentz force, the charges are advanced 
and the new phase space coordinates are determined by numerically solving the 
usual laws of Newtonian dynamics. For this purpose we use an explicit low-storage 
fourth-order Runge-Kutta approach (LSERK4) [7]. To close the chain of self- 
consistent interaction, the simulation particles have to be located with respect to 
the computational grid in order to assign the contribution of each charged particle 
to the changed charge and current densities [11], which are then the sources for the 
Maxwell equations in the subsequent time step. 
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Fig. 1 Standard PIC concept 


3 Highly Scalable Parallelization of the PIC Method 

Highly parallel PIC simulations are often handicapped by the fact that in many 
simulated problems, the particles are locally confined. This causes load balancing 
problems if a simple domain decomposition approach is used for the parallelization. 
The presented load balancing algorithm is based upon the concept that on each MPI 
domain, the two different solvers (i.e. the particle and the grid solver) each require 
different computation times and that these computation times can be traded to some 
extent. On a domain with lots of particles, the particle solver might require a much 
longer time than the grid solver. On a different domain, the work load might be equal 
while on yet another domain there might be no particles at all, so the particle solver 
does not need any computation time. And yet, if the load balancing is chosen in such 
a way that computation times of Maxwell and particle methods on each domain sum 
up to an average that is about the same on all domains, the computational load is 
balanced across MPI domains and the parallel computation is still efficient. 

However, the efficient load balancing relies on two assumptions: First, that an 
ideal average computational load can be identified, and second, that the computation 
can be distributed in such a way that all MPI domains receive a computational load 
close to the average. The following paragraphs briefly investigate whether these two 
assumptions are justified. 

Identifying an optimal average computational load. The domain decomposition 
assigns elements and the particles that are located in these elements to the available 
MPI processes. Pre-determined weights predict the computational load for each 
element and for each particle. The decomposition is performed with the goal that 
the sum of the weights of all elements and particles on each MPI domain equals the 
average of the sum of the weights of all particles and elements in the simulation. 

A DG element is weighted by its number of interpolation points. In order to find 
a suitable prediction for the computational load of a particle, a parameter study was 
conducted for the weighting factor of the particles versus a DG element. On 1,024 
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Fig. 2 A parameter study 
with a variation of particle 
weights shows that a ratio of 
about w ~ 0.003 yields the 
most efficient load balancing 
for 1,024 MPI processes 



0 0.002 0.004 0.006 0.008 0.01 

Normalized Particle Weight w [-J 


Fig. 3 For different grids, 
the maximum deviation from 
the medium (assumed-to-be 
ideal) weight is plotted over 
the number of MPI processes. 
The upper curves show 
simulations without load 
balancing while the lower 
curves show results from 
simulations with load 
balancing. The deviations are 
normalized by the average 
weight for each process 



MPI processes, the weighting factor of a single particle vs. a single element was 
varied in order to find the best particle-element ratio. The results of the study as 
depicted in Fig. 2 show that a ratio w = of about w kb 0.003 yields the most 
efficient load balancing for the fourth-order test simulation of the gyrotron resonator. 
It should be noted that this ratio of w kb 0.003 is an optimum for that specific 
simulation on 1,024 MPI processes. For simulations with different numbers of MPI 
processes, the ratio can vary slightly. For different computational grids, the ratio 
will probably vary a bit more. It will vary even more for different types of plasma 
devices. 

Distribution of the computational weight: Deviation from the average. In order 
to characterize the quality of the load balancing, several simulations were inspected 
with regard to their distribution of computational load. The total weight w tota i of 
all elements and particles in the computational domain was calculated using the 
ratio w mentioned above. An ideal weight for an MPI domain was then identified 
as with the number of MPI domains The total weight w, of all 

particles and elements on each MPI domain i proc was then compared to w to tai, and 
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Fig. 4 Domain decomposition for varying MPI process numbers, (a) Load balanced decom¬ 
position into 32 MPI domains, (b) Load balanced decomposition into 2,048 MPI domains, 
(c) Decomposition into 2,048 MPI domains without load balancing. All domains contain between 
98 and 99 elements, (d) Load balanced decomposition into 2,048 MPI domains. Domains with 
more than 140 elements are blanked 


the maximum deviation is plotted in Fig. 3 for each simulation. As is to be expected, 
the deviation increases for increasing n procs because the smallest delta of com¬ 
putational load that can be shifted from one domain to another is the load of a 
single element and the number of elements on each MPI domain decreases as n proC s 
increases. Likewise, the deviation increases with a decreasing number of grid cells 
in a computational mesh. From Fig. 3 it is also evident that the deviations of all load 
balanced simulations remain lower than the deviations of any non-load balanced 
simulation. 

It is thus to be concluded that both assumptions are justified. In the following, the 
effects of the balancing of the computational load on the number of grid cells in each 
MPI domain are inspected. Visualizations of the domains resulting from the load 
balancing algorithm are shown in Fig. 4 for the resonant cavity simulation presented 
in Sect. 4. The color of each domain reflects the number of elements it contains. 
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It is evident that the outer domains generally contain more elements than the inner 
domains. This is because the particles are confined to the inner region close to the 
coaxial insert (a detailed description is given in Sect. 4.1, especially Fig. 8 ). The 
more particles a domain contains, the smaller it will be in terms of the number of grid 
cells in order to keep the total computational load approximately uniform across all 
MPI domains. In order to contrast the load balanced decomposition, Fig. 4c shows 
a domain decomposition without load balancing. Here, the number of grid cells 
is approximately uniform across all MPI domains: All domains contain 98 or 99 
elements. It is clear, however, that in this decomposition, the Maxwell-only domains 
of the outer region will have to wait for one or more inner domains which have an 
additional computational load due to the particles of the electron beam. For the 
load balanced decomposition, these inner domains are highlighted in Fig. 4d where 
domains with more than 140 elements are blanked. The significant effect of the load 
balancing is evident because here some domains include only about 20 elements. 
Of course, this low number of elements results from the large number of particles 
contained in these domains. 


4 Gyrotron Simulation 

Gyrotrons are high-power milli- and micrometer wave generators. The gyrotron 
presented here is used in the context of fusion plasma heating, i.e. electron cyclotron 
resonance heating. A detailed overview to the state-of-the-art gyrotron research can 
be found in Ref. [13]. This section deals with the simulation of the 170 GHz resonant 
cavity which excites the TE 34 19 mode (as shown for an x-y-slice in Fig. 5). The 
TE 34 19 wave mode is generated by gyrating electrons and their interaction with their 
self-generated electromagnetic fields in what is known as the electron cyclotron 
maser instability. This interaction takes place in a cylindrical part of the gyrotron 
known as the resonant cavity or resonator. The radius of this cylindrical part has to 
be chosen depending on the desired wave mode. The gyrating electrons will then 
emit waves that are reflected from the walls and stimulate an azimuthal bunching 
of the electrons in a desired phase on the gyro-circle. The bunching amplifies the 
emitted electromagnetic waves at the desired frequency, due to a resonance effect 
between the gyro-frequency and the radio-frequency field. This process is described 
in detail by Kern [ 8 ] and Illy [ 6 ]. Simulations of a TE 2 .3 resonant cavity using the 
simulation code presented here were introduced by Stock et al. [12]. Clearly, the 
excitation of the wave modes relies on the self-consistent particle-field interaction. 
Therefore, the resonant cavity was considered a suitable test case to demonstrate 
the capabilities of the coupled particle and Maxwell solver. Since the simulation 
deals with high frequency wave generation at small wavelengths, the computational 
demand as well as the requirements with respect to memory are enormous. Thus, 
it is also a good test case for the efficiency of the parallelization. This is the case 
especially because the particle distribution is not homogeneous but all particles are 
confined to the small fraction of the volume that is occupied by the electron beam. 
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Fig. 5 The Ej_ component 
of the TE 34 19 mode shows 
the 34 wavelengths in 
circumferential direction and 
the 19 wavelengths across the 
diagonal that form the indices 
of the TE wave mode 



This accumulation of particles in a small part of the domain is a challenge to the 
balancing of the computational load. While one process might compute a region of 
the domain outside the beam which does not include particles at all, another region 
may include a large portion of the beam with a large number of particles inside. 

The simulation of the resonant cavity is detailed in the following two sections, 
beginning with the description of the simulation setup and the computational grid 
in Sect. 4.1, followed by a presentation and discussion of the simulation results in 
Sect. 4.2, and concluded by the discussion of scalability aspects in Sect. 4.3. 


4.1 Setup 

The geometry for the presented simulation was taken from Piosczyk et al. [10]. 
During the startup of the experiments presented in Ref. [10], various wave modes 
are temporarily excited as the voltage is continually increased. However, we do 
not simulate the complete startup of the gyrotron since the establishment of the 
TE 34 19 mode can be reached much more quickly numerically than experimentally. 
Therefore, a different operating point was chosen for the numerical simulation than 
the working point presented in [10], Preliminary simulations with the established 
cavity simulation tool SELFT [ 8 ] have shown that with the chosen parameter set, 
a TE 3349 mode develops first, followed by the excitation and final establishment of 
the TE 3419 mode. 

The simulated coaxial cavity is a cylindrical tube with radius r a = 29.55 mm and 
length Az r = 16 mm. Below the resonator (z, < 22 mm), the outer wall is down- 
tapered at an angle of fi 1 = 3° while above the resonator (z > 38 mm), the outer 
wall is up-tapered at an angle of /E = 2.5° as depicted in Fig. 6 . 
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H-30 mm 



Inside the cavity, there is an insert, down-tapered at an angle of /3, = 1° with a 
radius of r, = 7.86 mm at z = 30 mm. In the outer surface of the insert, there are 
75 rectangular corrugations (see Fig. 6). 

Boundary Conditions: Both the corrugated insert and the outer wall are modeled 
by a perfectly conducting boundary. Entry and exit at zo = 0 mm and zi = 60 mm, 
respectively, are modeled by non-reflecting boundaries. For the particles, entry and 
exit are open boundaries. Based on a current of / = 65 A, np art e = 405,698 particles 
with an MPF 1 of 10 6 are inserted per nanosecond at the entry at zo = 0 mm. These 
particles are distributed in space on a circular beam with radius r/, = 10.0 mm and 
length Az e = vy ■ At. The particles gyrate around the externally applied magnetic 
field of B- = 6.88 T with a Larmor radius of r g = 0.108 mm (see Fig. 7). The 
angular positions of the inserted particles on the beam as well as the phase of their 
gyration around B z is chosen at random with a uniform distribution. Their initial 
velocity of v p j = 1.52 x 10 8 ™ is determined from the acceleration due to an applied 
potential of U c = 82.2kV. The velocity is directed tangentially along a gyrating 
path around B z . The relation of the circumferential velocity v_l around B versus 
the velocity in z-direction vy is given by the factor of a = '— which is fixed here to 
a = 1 . 1 . 

Initial Conditions: The field solver is initialized with zero for the electromagnetic 
fields. The external magnetic field of B z = 6.88 T is applied to the particles only 
and is not included in the Maxwell simulation. The particles are initialized with the 
same beam parameters used for the emission also. However, for the initialization, 
the beam is created throughout the whole domain with Az = 60 mm. The initial 
number of particles is np arU o = np ar1e ^. 

Table 1 summarizes the relevant parameters for the resonator simulation. Note 
that for the scalability studies shown in Sect. 4.3, different computational parameters 
were used. This is due to the fact that the scalability analysis was done while a work- 


1 Macro particle factor (MPF), i.e. numbers of real particles per simulated particle. 
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Fig. 7 Particle parameters 
for the electron beam 



Table 1 Parameters for the 170 GHz resonant cavity simulation 


Symbol 

(Unit) 

Value 

Symbol 

(Unit) 

Value 

Geometry of the boundaries 

Computational parameters 

Jo 

(mm) 

0.0 

MPF 

(-) 

10 6 

Zl 

(mm) 

60.0 

n Part,0 

(-) 

237,668 

Az r 

(mm) 

16.0 

ft Part,e 

( 5 ) 

405,698 

r„ 

(mm) 

29.55 

0 

(-) 

5-6 

t‘i 

(mm) 

7.86 







Computational parameters for 

Technical parameters 

the scalability study 


U e 

(kV) 

82.2 

MPF 

(-) 

10 4 

I 

(A) 

55.0 

ft Part, 0 

(-) 

29,191,600 


(T) 

6.88 

ft Part,e 

( 5 ) 

46,811,300 




0 

(-) 

4 

Beam parameters 





i'b 

(mm) 

10.0 




r s 

(mm) 

0.108 




a 

(-) 

1.1 




Y 

(-) 

1.176 




ing setup for the simulation had not yet been found. For all resonator simulations 


presented in this work. Cell Mean Value particle-grid coupling was used [11]. 

Computational Grids: Due to their small geometric extensions in radial and 
circumferential direction (0.44 and 0.35 mm, respectively), the corrugations in the 
insert dictate the minimum size of grid cells (and thus also the time step). The most 
natural way to discretize these corrugations is with hexahedral elements. The cavity 
itself was also meshed using hexahedra. Building upon the experiences from the 
TE 34 i 9 waveguide and launcher simulations, an edge length of Ax ss 1 mm was 
chosen for the cavity, yielding a total of 201,720 hexahedral grid cells. 

The simulation was started with a spatial convergence order of 5. In the course 
of the simulation, however, it was realized that the polynomial degree of the basis 
functions in the fifth-order simulation was too small, resulting in a filtering of the 
higher wave modes. Therefore, the spatial order of the scheme was changed during 
the simulation from 05 to 06 at 11 ns which yielded the expected results. 
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Fig. 8 Compute grid for coupled simulation of the TE 3419 resonant cavity, (a) Complete view 
of the computational domain, (b) Detailed view on the computational domain with the zone 
containing particles marked blue 


The computational grid is depicted in Fig. 8 . Figure 8 a shows a cut-view of the 
mesh while Fig. 8 b provides a detailed view of the corrugated insert, showing also 
in blue the grid cells containing particles. 


4.2 Resonator Simulation Results 

In the presented simulation, the TE 3449 mode established after about 33 ns as shown 
in Fig. 9 for t = 36 ns. Figure 9 shows that the TE 34 39 wave mode really is in a stable 
state: The regular pattern of the TE 3439 mode dominates the domain, most notably 
the resonator itself and the uptapered region above the resonator. 

A result of more general interest in the resonator simulation is the computation 
time required per nanosecond. A sixth-order simulation on 512 CPU cores requires 
a wall-clock time of 16,100s per nanosecond simulation time on an Intel® Xeon 
X5570 (Nehalem) cluster with Infiniband network. 


4.3 Scaling Experiments 

In the presented 170 GHz resonator simulation, particle and Maxwell computation 
each require a significant part of the total computation time. Moreover, particle 

computation is confined to the region of the electron beam (see Fig. 8 b). Even on 
small numbers of MPI domains, there are domains with more than 50 % particle 
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Fig. 9 The wave mode at 
t = 36.0 ns. (a) B- on an 
x-y-slice at z = 30 mm. (b) 
B z on an x-z-slice at 
y = 0 mm. (c) B z on three 
slices 
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a b 



c d 



Fig. 10 Scalability curves resonator (strong scaling), (a) Simulation time resonator, (b) Speedup 
resonator, (c) Parallel efficiency resonator, (d) Speedup due to load balancing 


computation and, at the same time, domains with 100% Maxwell computation. 
Therefore, efficient load balancing is the key to any efficient parallel computation 
of the problem. 

Measurements were taken as pure computation times, i.e. they represent the 
wall clock time, not including initialization or I/O. The reason for this is that the 
simulation times can then be normalized as simulation time per time step. This 
normalized time provides a more general quantity, applicable for simulation runs 
of different length. Moreover, in a typical 170 GHz gyrotron resonator simulation, 
initialization, I/O and other computations at analyze time levels require less than 
1 % of the total wall clock time. 

Furthermore, it should be noted that the scalability studies with the 170 GHz 
gyrotron resonator were performed when a working parameter set for the successful 
simulation had not been found yet. Apart from small changes in the physical 
parameters, the computationally most notable differences are the MPF of 10 4 and 
the spatial order of the Maxwell scheme of 04 which were used for the scalability 
studies presented below. We note, therefore, that the computational load in the 
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scalability studies is shifted more towards the particle algorithms. In other words, in 
the simulation presented in Sect. 4.2 with the parameters summarized in Table 1, 
the Maxwell solver requires more computation time while the particle methods 
require less computation time compared to the simulations of the scalability studies. 
We emphasize that this does not affect the relevance of the results and their 
interpretations. Rather, the scalability simulations were performed with a more 
challenging setup with respect to the load balancing than actually required by the 
simulation presented in Sect. 4.2. 

The performance of the parallelization is best shown in the strong scaling study 
depicted in Fig. 10. The study was conducted on the mesh described in Sect. 4.1 
with spatial order 4. Serving as reference, one strong scaling study was done using 
a compilation without the particle algorithms, here referred to as “Maxwell only.” 
PIC simulations were done both with and without load balancing. 

The Maxwell only simulation scales almost perfectly from 1 to 16,384 MPI 
processes. The PIC simulations scale up to 16,384 processes but the decline in 
the speedup shows that a significant scalability beyond 16,384 MPI processes is 
not to be expected. The reason is that here, some MPI domains include one single 
element only. Thus, the computational load cannot be further distributed or balanced 
without a hybrid parallelization. Further speedups can most likely be achieved if 
an additional shared memory parallelization is used. Over a wide range of MPI 
process numbers, the load balanced PIC scheme is significantly faster than the 
scheme without load balancing. In the range from 1,024 to 8,192 processes, the 
load balanced simulations are more than twice as fast as the non-load balanced 
computations (as shown in Fig. lOd). 


5 Conclusions and Outlook 

We demonstrated the ability of our PIC code to simulate a 170 GHz gyrotron 
resonator with the currently available HPC capabilities. Since the 170 GHz gyrotron 
plays a crucial role in the development of future fusion reactors like ITER [13] 
we expect a high relevance of our work for the research in the field of gyrotron 
design. An efficient parallelization approach is crucial for achieving any results for 
a problem of the size presented here. Without the presented load balancing approach 
the ability to waste HPC capabilities can be enormous. A parallelization of the PIC 
method when dealing with problems of this size is therefore an essential feature 
which has to be centrally focused on in development and research. We believe that 
computations like the presented resonator simulation are essentially needed if open 
problems like after cavity interactions or beam misalignment are to be investigated. 
In future work we expect to verify a recent theory which questions coaxial super 
power gyrotron feasibility [2], concluding that maximal values of the azimuthal 
index exist, beyond which stationary single mode operation of gyrotron is not 
possible, due to the onset of stochastic oscillations of the radio-frequency fields: 
This issue definitely needs deeper investigations with a full-wave time-domain 
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code like the presented method. Both tasks of this challenging project are of great 
importance in the field of gyrotron development. 
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